After modeling data is
input, you need to configure variables. Now we’ll take *train.mtx* as an example to illustrate how to perform the
configuration.

There’s “Search variable” on the bottom of the main screen. And the number of loaded lines and variables are listed. To quickly locate a variable, enter it in the box to search for it.

In many cases you can’t build a model directly based on the original variables. You need to calculate them to generate computed variable for the model building.

Click “Edit -> Add computed variable” to configure a computed variable’s name and expression and click “OK” to add it. You can reference an existing function in the expression. Click a function to check its description.

For example, to add a
computed variable* Sex_b* whose value
is the first letter of values of variable *Sex*,
the expression should be like this:

A computed variable is handled as you handle a common variable.

The “Normal” tab provides
several variable expression types – “Ratio”, “Time interval”,** “**Date time combination”**,** “Interaction”, “Transformation” and
“Binning”. Users can quickly define a computed variable according to the
directions.

**(1) ****Ratio: **ratio
= * *

Types of variable *x*_{1 }and variable *x*_{2 }are numerical and count.
The expression result is a numerical variable.

**(2) ****Time interval**: interval = *x*_{1}-*x*_{2}

Types of variable* x*_{1 }and variable *x*_{2 }are time, date and
user-defined time and date. The expression result is a numerical variable.

The unit of time interval can be millisecond, second, minute, hour, day, week, month, quarter and year.

**(3) ****Date time combination**: Generate date variable, time variable and datetime variable.

_{}

“Combination type” can be:

Year, Month, Day (A date variable)

Hour, Minute, Second (A time variable)

Date, Time (A datetime variable)

“Format”: Set data format of a selected field.

**Note:**

（1）For the “Year,Month,Day” combination, if the format of month data is “MMM”, you need to set “Locale” as “English” when bringing data in.

（2）For the “Date,Time” combination, if date format is defalut then the format of default date value should be consistent with the configured value format. The same with time data. For example, if the date format is “yyyyMMdd”, then the format of default value should be the same as that of the current configuration.

**(4) ****Interaction**:
You can perform operations over two numerical variables or two categorical
variables, but you can’t if one is numerical variable and the other is a
categorical variable.

An operation over two numerical variables: interaction =

An operation over two categorical variables: interaction =

For example:

if *x1=[0,1,1,1,0,0,1];x2=[0,1,2,3,2,1,1]*,
then

The result is: *[(0,0), (1,1), (1,2), (1,3), (0,2), (0,1),
(1,1)]*, which is displayed as a string.

Types of variable *x*_{1 }and variable *x*_{2 }are numerical and count.
If variable *x*_{1} is a
numerical or count one, then variable *x*_{2
}can only be the same one. If variable *x*_{1
}is a categorical one, then variable *x*_{2
}can only be the same one, too.

Operations over two numerical variables generate a numerical variable value; Operations over two categorical variables generate a categorical variable value.

**(5) ****Transformation**: Functions that are able to perform transformation include logarithm,
tangent, arc tangent and hyperbolic tangent. If you calculate logarithm for a
number, the base can be e, 2 and 10.

Tangent:** ** tangent = tan(*x*)

Arc tangent:** **arc tangent = arctan(*x*)

Hyperbolic tangent: hyperbolic tangent =

Variable *x* is a numerical one or a count one, and
the result is a numerical variable.

**(6) ****Binning**

Variable *x* is a numerical one or a count one, the
result can be binary, categorical variable or numerical, which is determined by
the specific binning result.

The number of bins
resulted from an “Equi-width binning” and an “Equi-frequency binning” is in the
range of 2-100.** **

“Custom”: Enter one number each time in the box after “Enter a bin boundary” between the bin low and the bin high listed below and click “Add” to add bin boundaries.

We analyze a variable to get information about it. You can choose to analyze one variable or all variables. To make the analysis:

Select a variable and click “Variable analysis” or “Analyze all variables” in the drop-down menu under “Run”.

To analyze all variables:

The analysis continues until the message “Variable analysis is finished” appears.

There are 8 variable types for model building – numerical variable, unary variable, binary variable, count variable, categorical variable, ID, time and date, and text string.

Among them the ID variable and text string variable need not be analyzed. Over a categorical variable, we calculate its missing value rate and association strength and show them with a pie chart. The missing value rate represents the percentage of a variable’s missing records in the total number of records; and the association strength represents the number of unique values in a variable.

For the categorical
variable “*Embarked*”, there are 4
types of values – S, C, Q and missing values. The analysis result is shown
below:

The results of analyzing a unary variable and a binary variable are similar. About the strength of association, its value is 1 for a unary variable and 2 for a binary variable.

Analyses on count
variables and numerical variables are presented through a series of indexes,
figures and graphs. There are 4 presentation types – “Descriptive statistics”,
“Frequency distributions”, “Descriptive statistics of grouped target”, and
“Frequency distributions of grouped target”. Below is an example (statistics of
numerical variable *Age*):

There are same statistical presentation types for date variable. But the most commonly used are “Descriptive statistics” and “Descriptive statistics of grouped target”, as shown below:

There are two filter ways – By “Importance” and by “Variable type”. Both use the option “Only filter selected variables”. If the option is checked, only filter the selected fields; if unchecked, filter all.

**Degree of importance**

Degrees of importance of variables will be shown after you execute model building. After that you can filter variables again according to the degrees of importance returned by the newly-created model. The importance degree of ID variable is 0, and that of the target variable won’t be analyzed. You can select top N variables by importance, or select one or more variables whose importance degree is greater than a specified value. The “Importance” option is grey (inactivated) until you execute model building.

**Variable type**

To filter variables by variable type is to “Select by variable type”. Types that don’t exist in the loaded data will turn grey.

** **

A target variable is a variable to be scored, or the field to be predicted. It is a binary variable or a numerical variable. You can select one among all loaded variables in the drop-down menu under “Target variable” as the target variable.

Take the binary variable
“*Survived*” as an example: