This section covers model options configuration, model building execution and model file information.
On “Model options” window, you can configure the model parameters to make a better model. There are 3 tabs – “Normal”, “Binary model”, “Regression model”, on which you configure options to build different types of model.
Normal
“Intelligent impute”: Intelligently assign a value to a missing value.
“Resampling”: Resample data when only one model is selected.
“Number of samples”: The number of samples needed; default is 5.
“Best number of sample combinations”: Select the best model or the best combination of models. 0 means selecting the most effective model among the combinations; >0 means selecting the fixed top N models; and <0 means selecting the one among the top N models that makes the most effective combination. Default is 3.
“Balanced sampling ratio”: The proportion of the number of samples in the minority class to the number of samples in the majority class; default is 1:1.
“Sample multiplier”: The option decides the amount of sample data: Number of variables*Sample multiplier=Sample data; default is 150.
“Ensemble method”: “Optimal model strategy” and “Simple model combination”. The former selects best top N models to build a new model and involves a relatively large amount of computations. The latter just combines all defined models to build a new model and involves a relatively small amount of computations.
“Best number of ensembles”: Select the best model combination. 0 means selecting the most effective model among the combinations; >0 means selecting the fixed top N models; and <0 means selecting the one among the top N models that makes the most effective combination. Default is 0.
“Ensemble function”: The option specifies the approach of combining models. You can select any function included in numpy. Default is np.mean.
“Model evaluation criterion”: Use AUC statistics for a binary model and MSE for a regression model.
“Percentage of test data”: Test data percentage.
“Adjust scoring results”: By default the model scoring results will be adjusted according to the average of the sample data. Without adjustment the score is the average of the balanced samples.
“Set random seeds”: You can control randomness of model building through this option; default is 0. If the value is null, two model building executions will get random results respectively. If the value is set as integer n, two executions will get same results if their ns are same; and different results if their ns are different. When the random seed is set as n, the random_state value of all models will be set as the same value and can’t be manually changed.
Binary model
You can configure parameters for binary models on “Binary model” tab. A selected binary model will be used for model building.
There are 7 types of binary model: TreeClassification, GBDTClassification, RFClassification, LogicClassification, RidgeClassification, FNNClassification and XGBClassification
“Number of samples” determines the number of samples used to build a model.
Below are parameter configuration directions for binary models.
Appendix 1: Binary model parameters
Type & Range: The type is always followed by an interval indicating the parameter’s value range. Square brackets indicate a closed interval and parentheses indicate an opened interval. Braces are used for certain int and float parameters to represent drop-down-menu values; the format is {start value, end value, interval}, such as {1, 5, 1}=[1,2,3,4,5] and {1, 6, 2}=[1,3,5]. All available values of string parameters will be listed in the drop-down menu and can’t be entered manually; a null value should be selected through the drop-down menu; the bool values, which are true and false, also should be selected in the drop-down menu; values of int parameters and float parameters can be enterned manually and all available values will be listed in drop-down menu.
For some float parameters, if their values are integers, they need to be followed by .0, like 0.0 and 1.0.
TreeClassification
Parameter |
Type & Range |
Description |
criterion |
string: ["gini", "entropy"] |
Criterions for evaluating node splitting. |
splitter |
string: ["best", "random"] |
Choose a splitting strategy for each node. |
max_depth |
int: [1, +∞), {1, 100, 1} null |
Maximum tree depth. |
min_samples_split |
int: [1, +∞), {10, 1000, 10} float: (0, 1) |
Minimum amount sampling data for node splitting. int is the min amount of sampling data and float is the proportion of it to the whole data. |
min_samples_leaf |
int: [1, +∞), {10, 1000, 10} float: (0, 1) |
Minimum amount of sampling data for leaf node. int is the min amount of sampling data and float is the proportion of it to the whole data. |
min_weight_fraction_leaf |
float: [0, 1), {0, 0.1, 0.01} |
Minimum weight among all weights of the input sampling data at a leaf node. |
max_features |
int: [1, +∞), {10, 1000, 10} float: (0, 1] string: ["auto", "sqrt", "log2"] null |
Get the maximum number of variables for the optimal node splitting. Max number of variables for an int parameter. Max proportion of variables for a float parameter. If "auto", then max_features=sqrt(n_features). If "sqrt", then max_features=sqrt(n_features). If "log2", then max_features=log2(n_features). If null, then max_features=n_features. |
random_state |
int: (-∞, +∞), {0, 20, 1} null |
Random seed; null means the absolute randomness. |
max_leaf_nodes |
int: [1, +∞), {10, 1000, 10} null |
Use best-first fashion to generate the largest number of leaf nodes in a pruned tree. null means that there’s no limitation on the number of leaf nodes. |
min_impurity_decrease |
float: [0, 1) |
The lowest impurity decrease for node splitting. |
class_weight |
string: [“balanced”] null |
Weights associated with classes in the form {class_label: weight}. |
presort |
bool |
Whether to presort data to speed up training. |
GBDTClassification
Parameter |
Type |
Description |
loss |
string: ["deviance", "exponential"] |
A loss function. |
learning_rate |
float: (0, 1), {0.1, 0.9, 0.1} |
The learning rate, which is in direct ratio to the training speed. But it’s probably that there isn’t an optimal solution. |
n_estimators |
int: [1, +∞), {10, 500, 10} |
Number of boosting stages. |
subsample |
float: (0, 1], {0.1, 1, 0.1} |
The ratio of sampling data used by a basic machine learning method. |
criterion |
string: ["mse", "friedman_mse", "mae"] |
Criterions for evaluating node splitting. |
min_samples_split |
int: [1, +∞), {10, 1000, 10} float: (0, 1) |
Minimum amount sampling data for node splitting. int is the min amount of sampling data and float is the proportion of it to the whole data. |
min_samples_leaf |
int: [1, +∞), {10, 1000, 10} float: (0, 1) |
Minimum amount of sampling data for leaf node. int is the min amount of sampling data and float is the proportion of it to the whole data. |
min_weight_fraction_leaf |
float: [0, 1), {0, 0.1, 0.01} |
Minimum weight among all weights of the input sampling data at a leaf node. |
max_depth |
int: [1, +∞), {1, 100, 1} null |
Maximum tree depth. |
min_impurity_decrease |
float: [0, 1) |
The lowest impurity decrease for node splitting. |
random_state |
int: (-∞, +∞), {0, 20, 1} null |
Random seed; null means the absolute randomness. |
max_features |
int: [1, +∞), {10, 1000, 10} float: (0, 1] string: ["auto", "sqrt", "log2"] null |
Get the maximum number of variables for the optimal node splitting. Max number of variables for an int parameter. Max proportion of variables for a float parameter. If "auto", then max_features=sqrt(n_features). If "sqrt", then max_features=sqrt(n_features). If "log2", then max_features=log2(n_features). If null, then max_features=n_features. |
max_leaf_nodes |
int: [1, +∞), {10, 1000, 10} null |
Use best-first fashion to generate the largest number of leaf nodes in a pruned tree. null means that there’s no limitation on the number of leaf nodes. |
warm_start |
bool |
Use the result of the previous iteration if the value is true; and won’t use that if the value is false. |
presort |
string: [“auto”] bool |
Whether to presort data. |
RFClassification
Parameter |
Type |
Description |
n_estimators |
int: [1, +∞), {10, 500, 10} |
The number of trees. |
criterion |
string: ["gini", "entropy"] |
Criterions for evaluating node splitting. |
max_depth |
int: [1, +∞), {1, 100, 1} null |
Maximum tree depth. |
min_samples_split |
int: [1, +∞), {10, 1000, 10} float: (0, 1) |
Minimum amount sampling data for node splitting. int is the min amount of sampling data and float is the proportion of it to the whole data. |
min_samples_leaf |
int: [1, +∞), {10, 1000, 10} float: (0, 1) |
Minimum amount of sampling data for leaf node. int is the min amount of sampling data and float is the proportion of it to the whole data. |
min_weight_fraction_leaf |
float: [0, 1), {0, 0.1, 0.01} |
Minimum weight among all weights of the input sampling data at a leaf node. |
max_features |
int: [1, +∞), {10, 1000, 10} float: (0, 1] string: ["auto", "sqrt", "log2"] null |
Get the maximum number of variables for the optimal node splitting. Max number of variables for an int parameter. Max proportion of variables for a float parameter. If "auto", then max_features=sqrt(n_features). If "sqrt", then max_features=sqrt(n_features). If "log2", then max_features=log2(n_features). If null, then max_features=n_features. |
max_leaf_nodes |
int: [1, +∞), {10, 1000, 10} null |
Use best-first fashion to generate the largest number of leaf nodes in a pruned tree. null means that there’s no limitation on the number of leaf nodes. |
min_impurity_decrease |
float: [0, 1) |
The lowest impurity decrease for node splitting. |
bootstrap |
bool |
Whether to use bootstrap when generating a tree. |
oob_score |
bool |
Whether to use out-of-bag samples to predict accuracy. |
random_state |
int: (-∞, +∞), {0, 20, 1} null |
Random seed; null means the absolute randomness. |
warm_start |
bool |
Use the result of the previous iteration if the value is true; and won’t use that if the value is false. |
class_weight |
string: [“balanced”] null |
Weights associated with classes in the form {class_label: weight}. |
LogicClassification
Parameter |
Type |
Description |
penalty |
string: ["l1", "l2", "elasticnet", "none"] |
Penalty regularization. "newton-cg", "sag" and "lbfgs" solvers support "l2" only; "elasticnet" supports ‘saga’ solver only; "none" means non-regularization and doesn’t support liblinear solver. |
dual |
bool |
Dual or primal formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer dual=False when n_samples > n_features. |
tol |
float: (0, 1) |
The tolerance value before stopping iteration. |
C |
float: (0, 1] |
Inverse of regularization strength, which must be a positive. |
fit_intercept |
bool |
Whether to include an intercept item. |
intercept_scaling |
float: (0, 1] |
Only works when solver="liblinear". |
class_weight |
string: [“balanced”] null |
Weights associated with classes in the form {class_label: weight}. |
random_state |
int: (-∞, +∞), {0, 20, 1} null |
Random seed; null means the absolute randomness. |
solver |
string: ["newton-cg", "lbfgs", "liblinear", "sag", "saga"] |
The optimal algorithm. |
max_iter |
int: [1, +∞), {10, 500, 10} |
Maximum iterations; only works when solver=["newton-cg", "lbfgs", "sag"]. |
multi_class |
string: ["ovr", "multinomial", "auto"] |
The algorithm for handling multiple classes. "ovr" builds a model for each class; "multinomial" doesn’t work with solver="liblinear". |
warm_start |
bool |
Use the result of the previous iteration if the value is true; and won’t use that if the value is false. |
RidgeClassification
Parameter |
Type |
Description |
alpha |
float: [0, 1], {0.0, 1.0, 0.1} |
Regularization strength, which must be a positive. |
fit_intercept |
bool |
Whether to include an intercept item. |
normalize |
bool |
Whether to normalize data. |
max_iter |
int: [1, +∞), {10, 500, 10} null |
Maximum iterations. |
tol |
float: (0, 1) |
Precision of the final solution. |
class_weight |
string: [“balanced”] null |
Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. "balanced" for auto-adjust. |
solver |
string: ["auto", "svd", "cholesky", "lsqr", "sparse_cg", "sag", "saga"] |
The optimal algorithm. |
random_state |
int: (-∞, +∞), {0, 20, 1} null |
Random seed; null means the absolute randomness. |
FNNClassification
Parameters for this type of binary model are unable to be opened for the time being due to some special features of the neural networks.
XGBClassification
Parameter |
Type |
Description |
max_depth |
int: [1, +∞), {1, 100, 1} |
Maximum tree depth. |
learning_rate |
float: (0, 1), {0.1, 0.9, 0.1} |
The learning rate, which is in direct ratio to the training speed. But it’s probably that there isn’t an optimal solution. |
n_estimators |
int: [1, +∞), {10, 500, 10} |
Number of booster trees. |
objective |
string: ["binary:logistic", "binary:logitraw", "binary:hinge"] |
Learning objective; binary:logistic: Binary logistic regression for outputting probability; binary:logitraw: Binary logistic regression for outputting the score before logistic transformation; binary:hinge: Binary hinge loss for outputting class 0 or class 1 instead of the probability. |
booster |
string: ["gbtree", "gblinear", "dart"] |
The booster type used. |
gamma |
float: [0, +∞) |
The smallest loss mitigation value for node splitting. |
min_child_weight |
int: [1, +∞), {10, 1000, 10} |
The minimum sum of sampling weights of child nodes. |
max_delta_step |
int: [0, +∞), {0, 10, 1} |
The allowed longest delta step for evaluating a tree’s weight. |
subsample |
float: (0, 1], {0.1, 1.0, 0.1} |
The proportion of subsample for training a model to the whole set of samplings. |
colsample_bytree |
float: (0, 1], {0.1, 1.0, 0.1} |
Proportion of the random sampling from the features for each tree. |
colsample_bylevel |
float: (0, 1], {0.1, 1.0, 0.1} |
Proportion of random sampling from the features on each horizontal level for node splitting. |
reg_alpha |
float: [0, 1], {0.0, 1.0, 0.1} |
L1 regularization term. |
reg_lambda |
float: [0, 1], {0.0, 1.0, 0.1} |
L2 regularization term. |
scale_pos_weight |
float: (0, +∞) |
Control the balance of positive samples and negative samples. |
base_score |
float: (0, 1), {0.1, 0.9, 0.1} |
The initial value for starting a prediction. |
random_state |
int: (-∞, +∞), {0, 20, 1} null |
Random seed; null means the absolute randomness. |
missing |
float: (-∞, +∞) null |
Define a missing value. |
Regression model
You can configure parameters for regression models on “Regression model” tab. A selected regression model will be used for model building.
There are 9 types of regression models – TreeRegression, GBDTRegression, RFRegression, LRegression, LassoRegression, ENRegression, RidgeRegression, FNNRegression and XGBRegression.
“Number of samples” determines the number of samples used to build a model.
Below are parameter configuration directions for regression models.
Appendix 2: Regression model parameters
Type & Range: The type is always followed by an interval indicating the parameter’s value range. Square brackets indicate a closed interval and parentheses indicate an opened interval. Braces are used for certain int and float parameters to represent drop-down-menu values; the format is {start value, end value, interval}, such as {1, 5, 1}=[1,2,3,4,5] and {1, 6, 2}=[1,3,5]. All available values of string parameters will be listed in the drop-down menu and can’t be entered manually; a null value should be selected through the drop-down menu; the bool values, which are true and false, also should be selected in the drop-down menu; values of int parameters and float parameters can be enterned manually and all available values will be listed in drop-down menu.
For some float parameters, if their values are integers, they need to be followed by .0, like 0.0 and 1.0.
TreeRegression
Parameter |
Type |
Description |
criterion |
string: ["mse", "friedman_mse", "mae"] |
Criterions for evaluating node splitting. |
splitter |
string: ["best", "random"] |
Choose a splitting strategy for each node. |
max_depth |
int: [1, +∞), {1, 100, 1} null |
Maximum tree depth. |
min_samples_split |
int: [1, +∞), {10, 1000, 10} float: (0, 1) |
Minimum amount sampling data for node splitting. int is the min amount of sampling data and float is the proportion of it to the whole data. |
min_samples_leaf |
int: [1, +∞), {10, 1000, 10} float: (0, 1) |
Minimum amount of sampling data for leaf node. int is the min amount of sampling data and float is the proportion of it to the whole data. |
min_weight_fraction_leaf |
float: [0, 1), {0, 0.1, 0.01} |
Minimum weight among all weights of the input sampling data at a leaf node. |
max_features |
int: [1, +∞), {10, 1000, 10} float: (0, 1] string: ["auto", "sqrt", "log2"] null |
Get the maximum number of variables for the optimal node splitting. Max number of variables for an int parameter. Max proportion of variables for a float parameter. If "auto", then max_features=sqrt(n_features). If "sqrt", then max_features=sqrt(n_features). If "log2", then max_features=log2(n_features). If null, then max_features=n_features. |
random_state |
int: (-∞, +∞), {0, 20, 1} null |
Random seed; null means the absolute randomness. |
max_leaf_nodes |
int: [1, +∞), {10, 1000, 10} null |
Use best-first fashion to generate the largest number of leaf nodes in a pruned tree. null means that there’s no limitation on the number of leaf nodes. |
min_impurity_decrease |
float: [0, 1) |
The lowest impurity decrease for node splitting. |
presort |
bool |
Whether to presort data to speed up training. |
GBDTRegression
Parameter |
Type |
Description |
loss |
string: ["ls", "lad", "huber", "quantile"] |
A loss function. |
learning_rate |
float: (0, 1), {0.1, 0.9, 0.1} |
The learning rate, which is in direct ratio to the training speed. But it’s probably that there isn’t an optimal solution. |
n_estimators |
int: [1, +∞), {10, 500, 10} |
Number of boosting stages. |
subsample |
float: (0, 1], {0.1, 1, 0.1} |
The ratio of sampling data used by a basic machine learning method. |
criterion |
string: ["mse", "friedman_mse", "mae"] |
Criterions for evaluating node splitting. |
min_samples_split |
int: [1, +∞), {10, 1000, 10} float: (0, 1) |
Minimum amount sampling data for node splitting. int is the min amount of sampling data and float is the proportion of it to the whole data. |
min_samples_leaf |
int: [1, +∞), {10, 1000, 10} float: (0, 1) |
Minimum amount of sampling data for leaf node. int is the min amount of sampling data and float is the proportion of it to the whole data. |
min_weight_fraction_leaf |
float: [0, 1), {0, 0.1, 0.01} |
Minimum weight among all weights of the input sampling data at a leaf node. |
max_depth |
int: [1, +∞), {1, 100, 1} null |
Maximum tree depth. |
min_impurity_decrease |
float: [0, 1) |
The lowest impurity decrease for node splitting. |
random_state |
int: (-∞, +∞), {0, 20, 1} null |
Random seed; null means the absolute randomness. |
max_features |
int: [1, +∞), {10, 1000, 10} float: (0, 1] string: ["auto", "sqrt", "log2"] null |
Get the maximum number of variables for the optimal node splitting. Max number of variables for an int parameter. Max proportion of variables for a float parameter. If "auto", then max_features=sqrt(n_features). If "sqrt", then max_features=sqrt(n_features). If "log2", then max_features=log2(n_features). If null, then max_features=n_features. |
alpha |
float: (0, 1), {0.1, 0.9, 0.1} |
The alpha-quantile of the huber loss function and the quantile loss function. Only if loss='huber' or loss='quantile' |
max_leaf_nodes |
int: [1, +∞), {10, 1000, 10} null |
Use best-first fashion to generate the largest number of leaf nodes in a pruned tree. null means that there’s no limitation on the number of leaf nodes. |
warm_start |
bool |
Use the result of the previous iteration if the value is true; and won’t use that if the value is false. |
presort |
string: [“auto”] bool |
Whether to presort data. |
RFRegression
Parameter |
Type |
Description |
n_estimators |
int: [1, +∞), {10, 500, 10} |
The number of trees. |
criterion |
string: ["mse", "mae"] |
Criterions for evaluating node splitting. |
max_depth |
int: [1, +∞), {1, 100, 1} null |
Maximum tree depth. |
min_samples_split |
int: [1, +∞), {10, 1000, 10} float: (0, 1) |
Minimum amount sampling data for node splitting. int is the min amount of sampling data and float is the proportion of it to the whole data. |
min_samples_leaf |
int: [1, +∞), {10, 1000, 10} float: (0, 1) |
Minimum amount of sampling data for leaf node. int is the min amount of sampling data and float is the proportion of it to the whole data. |
min_weight_fraction_leaf |
float: [0, 1), {0, 0.1, 0.01} |
Minimum weight among all weights of the input sampling data at a leaf node. |
max_features |
int: [1, +∞), {10, 1000, 10} float: (0, 1] string: ["auto", "sqrt", "log2"] null |
Get the maximum number of variables for the optimal node splitting. Max number of variables for an int parameter. Max proportion of variables for a float parameter. If "auto", then max_features=sqrt(n_features). If "sqrt", then max_features=sqrt(n_features). If "log2", then max_features=log2(n_features). If null, then max_features=n_features. |
max_leaf_nodes |
int: [1, +∞), {10, 1000, 10} null |
Use best-first fashion to generate the largest number of leaf nodes in a pruned tree. null means that there’s no limitation on the number of leaf nodes. |
min_impurity_decrease |
float: [0, 1) |
The lowest impurity decrease for node splitting. |
bootstrap |
bool |
Whether to use bootstrap when generating a tree. |
oob_score |
bool |
Whether to use out-of-bag samples to predict accuracy. |
random_state |
int: (-∞, +∞), {0, 20, 1} null |
Random seed; null means the absolute randomness. |
warm_start |
bool |
Use the result of the previous iteration if the value is true; and won’t use that if the value is false. |
LRegression
Parameter |
Type |
Description |
fit_intercept |
bool |
Whether to include an intercept item. |
normalize |
bool |
Whether to normalize data. |
LassoRegression
Parameter |
Type |
Description |
alpha |
float: [0, 1], {0.0, 1.0, 0.1} |
A constant multiplied by L1 item. |
fit_intercept |
bool |
Whether to include an intercept item. |
normalize |
bool |
Whether to normalize data. |
precompute |
string: ["auto"] bool |
Whether to precompute Gram matrix to speed up model building. |
max_iter |
int: [1, +∞), {10, 500, 10} |
Maximum iterations. |
tol |
float: (0, 1) |
The tolerance value before stopping iteration. |
warm_start |
bool |
Use the result of the previous iteration if the value is true; and won’t use that if the value is false. |
positive |
bool |
Whether to rule that a coefficient must be positive. |
random_state |
int: (-∞, +∞), {0, 20, 1} null |
Random seed; null means the absolute randomness. |
selection |
string: ["cyclic", "random"] |
"cyclic" means iteration by loop by variables; "random" represents a random iteration coefficient. |
ENRegression
Parameter |
Type |
Description |
alpha |
float: [0, 1], {0.0, 1.0, 0.1} |
A constant multiplied by penalty item. |
l1_ratio |
float: [0, 1], {0.0, 1.0, 0.1} |
A mixed parameter; its value is L2 when l1_ratio=0; and its value is L1 when l1_ratio=1; its value is mixed proportion when 11_ratio falls between 0 and 1. |
fit_intercept |
bool |
Whether to include an intercept item. |
normalize |
bool |
Whether to normalize data. |
precompute |
bool |
Whether to precompute Gram matrix to speed up model building. |
max_iter |
int: [1, +∞), {10, 500, 10} |
Maximum iterations. |
tol |
float: (0, 1) |
The tolerance value before stopping iteration. |
warm_start |
bool |
Use the result of the previous iteration if the value is true; and won’t use that if the value is false. |
positive |
bool |
Whether to rule that a coefficient must be positive. |
random_state |
int: (-∞, +∞), {0, 20, 1} null |
Random seed; null means the absolute randomness. |
selection |
string: ["cyclic", "random"] |
"cyclic" means iteration by loop by variables; "random" represents a random iteration coefficient. |
RidgeRegression
Parameter |
Type |
Description |
alpha |
float: [0, 1], {0.0, 1.0, 0.1} |
Regularization strength, which must be a positive. |
fit_intercept |
bool |
Whether to include an intercept item. |
normalize |
bool |
Whether to normalize data. |
max_iter |
int: [1, +∞), {10, 500, 10} null |
Maximum iterations. |
tol |
float: (0, 1) |
Precision of the final solution. |
solver |
string: ["auto", "svd", "cholesky", "lsqr", "sparse_cg", "sag", "saga"] |
The optimal algorithm. |
random_state |
int: (-∞, +∞), {0, 20, 1} null |
Random seed; null means the absolute randomness. |
FNNRegression
Parameters for this type of regression model are unable to be opened for the time being due to some special features of the neural networks.
XGBRegression
Parameter |
Type |
Description |
max_depth |
int: [1, +∞), {1, 100, 1} |
Maximum tree depth. |
learning_rate |
float: (0, 1), {0.1, 0.9, 0.1} |
The learning rate, which is in direct ratio to the training speed. But it’s probably that there isn’t an optimal solution. |
n_estimators |
int: [1, +∞), {10, 500, 10} |
Number of booster trees. |
objective |
string: ["reg:squarederror", "reg:squaredlogerror", "reg:logistic"] |
Learning objective; reg:squarederror: Squared error loss; reg:squaredlogerror: Squared log error loss; reg:logistic: logistic regression. |
booster |
string: ["gbtree", "gblinear", "dart"] |
The booster type used. |
gamma |
float: [0, +∞) |
The smallest loss mitigation value for node splitting. |
min_child_weight |
int: [1, +∞), {10, 1000, 10} |
The minimum sum of sampling weights of child nodes. |
max_delta_step |
int: [0, +∞), {0, 10, 1} |
The allowed longest delta step for evaluating a tree’s weight. |
subsample |
float: (0, 1], {0.1, 1.0, 0.1} |
The proportion of subsample for training a model to the whole set of samplings. |
colsample_bytree |
float: (0, 1], {0.1, 1.0, 0.1} |
Proportion of the random sampling from the features for each tree. |
colsample_bylevel |
float: (0, 1], {0.1, 1.0, 0.1} |
Proportion of random sampling from the features on each horizontal level for node splitting. |
reg_alpha |
float: [0, 1], {0.0, 1.0, 0.1} |
L1 regularization term. |
reg_lambda |
float: [0, 1], {0.0, 1.0, 0.1} |
L2 regularization term. |
scale_pos_weight |
float: (0, +∞) |
Control the balance of positive samples and negative samples. |
base_score |
float: (0, 1), {0.1, 0.9, 0.1} |
The initial value for starting a prediction. |
random_state |
int: (-∞, +∞), {0, 20, 1} null |
Random seed; null means the absolute randomness. |
missing |
float: (-∞, +∞) null |
Define a missing value. |
To build a predictive model, you must choose a target variable and then select a modeling table file through “Model file”. By default the modeling table file is stored under the same directory where the loaded data is stored and has the same name as the loaded data file. Users can define the path and name themselves. A model document is one with .pcf extension.
Click “Modeling” option under “Run”, or click on the toolbar, to pop up the “Build model” window, where model building information is output.
Model building is finished when the message “Model building is finished” appears.
“Importance” will be displayed after a model is built, as shown below. A variable’s degree of importance indicates its influence on the future predictive result. The higher the degree of importance is, the bigger a variable’s influence is. A variable with zero importance degree has no impact on the predictive result. As the following shows, the Sex variable has the biggest influence on the result.
Note: If the data source location is “Remote server” when loading data, the model building will be executed on a remote server.
Model presentation
YiMing AI Model encapsulates multiple algorithms for model building, they are: Decision Tree, Gradient Boosting, Logistic Regression, Neural Network, Random Forest, Elastic Net, LASSO Regression, Linear Regression, Ridge Regression, and XGBoost.
After the model building is finished, you can click “Model presentation” in “Build model” window to view the algorithm(s) used to build the model, as shown below:
Model performance
Model performance can be reflected through a series of indexes and figures.
Click “Model performance” in “Build model” window:
Models built on different types of target variables are evaluated by different parameters and forms.
Below is model performance information about binary target variable “Survived”:
Here’s model performance information about numerical target variable “Age”: