For model building

Read(113) Label: json file, modelbuilding,

Before we start to build a model, we need to configure a set of parameters. They are data source, variables on which the model is build, model building options, the path where a model file is stored, whether to preprocess data or not, etc. On execution these parameters will be saved as a mcf file in the format of JSON strings. Here let’s learn about the parameters for model building.

Parameters

{

"modelType" // Source data type; default is 0, which represents a local file

"modelMetaData": //Metadata{

"fieldList": // List of variables [

{

"varName"// Variable name

"dataType" //Variable type: 0 – Default type auto-check; 1 – Binary variable; 2 – Unary variable; 3 – Categorical variable; 11 – Numerical variable; 12 – Count variable whose value is an integer; 13 – Datetime variable; 20 – ID; 21 – Text string

//Note: An auto-check error could arise

"isTarget" // A target variable or not; a Boolean value of true or false; only one true value is allowed.

"isActionable" // Operatable or not; a Boolean value of true or false; default is true

"isSource" // Non-target variable or not; a Boolean value of true or false; default is true

"importance"// Variable importance degree (0-1) returned after model building is finished

"isComputeCol"// Computed variable (column) or not; a Boolean value of true or false

"isTypeDefined" // Data type is defined or not; a Boolean value of true or false

"isSourceDefined" // Variable selection is done or not; a Boolean value of true or false

},

  //Next variable

  //

]

"modelFile" // Model file (.pcf file) path

"dsType" // Loaded file type – 0, txt and csv

"dsConfig" // File loading configuration {

"srcFilePath" //Source file path

"hasTitle" // Load headers or not

"fieldNames"//Field name

"fieldTypes" //Field type

"useDisp" // Display line count or not

"dispNumber" //Display line count

"charset" // Character set configuration

"separator" // Separator

"isEscape"// Remove all quotation marks

"isTransQuota" // Use double quotation marks as escape character or not

"checkValid" // Check a line where column count does not match value count at line 1

"skipErrorRow" // Skip ineligible lines

"useTop" // Import Top N line or not

"topNumber" // Import Top N lines

"useBlock"// Import block by block or not

"blockIndex" // Block number

"blockCount" //Block count

"dateFormat" // Date format

"timeFormat" // Time format

"dateTimeFormat" // Datetime format

"missingFormat" // Missing value definition

"language" // Language for locale

"country" // Country for locale

"variant" //locale variable

"formats" // List of datetime formats

},

"needPrepare" // Preprocess source data or not

"parallelNumber" // Parallel tasks count for preprocessing

"isIntelligenceImpute" // Intelligent impute or not

"isResample" //Resample or not

"balanceParams" // Balanced sample ratio (int type); target variable balance parameters, whose value is the range 1-9; [1] means the ratio of majority sample and minority sample is 1:1

"advanceSelect" // Use advanced variable selection configuration or not; output all data for preprocessing when selected

"optimalParam" // Search optimal parameter or not

"resampleMultiple" //Sample multiplier

"testDataPercent" // Test data percentage is 1%-99%

"ensembleMethod" // “Best top N” and “Simple”; the former selects best N models to build a new model and involves a comparatively large computation amount; the latter just combines all defined models and involves a comparatively small computation amount

"resampleBestN" //If is_resample=true, resample and select multiple models, which is equivalent to Best N; recommended default N is 3

"resampleNumber" // Sample count

"ensembleFunc" // Ensemble function

"ensembleBestN" // Best N for model building ensemble

"dataBalances" // Target variable data ratio of majority to minority (float type)

"chunkSize" // Scoring result set chunk count

"adjustProb" // Adjust scoring result or not

"fixedSeed" // Fixed seeds or not

"randomState" // Set random seeds to control model building randomness; default is 0; 0 – get random model object after two executions; n – same integer in two executions generates same model objects; otherwise different model objects

"classModels": // List of classification model parameters

[

"{\"modelName\"//model name,\"count\"//sample count}",

]

"regressionModels": // List of regression model parameters

[

"{\"modelName\"// model name,\"count\"// sample count }",

"isEscape" // Remove all quotation marks at data loading

"classCategoryCount"// Category count

"numberCategoryCount"// Segment count

"groupMaxCount"// Max record count in a group

"groupMinCount"// Min record count in a group

"accuracySetting" // Threshold value display configuration {

"accuracyMin" // Min threshold value

"accuracyMax" // Max threshold value

"accuracyCount" // Segment count

}

],

"srcFilePath" //mtx file path

"rowCount" // Number of rows

"colCount" // Number of columns

}

}

 

Example

We’ll illustrate how to use a JSON file to build a model using the following data.

Titanic passenger data:

12 fields (variables) for model building: PassengerId, Survied, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked

The target variable is Survived. It is a binary variable that represents whether a passenger is alive or dead (1 is survived; 0 is dead)

 

The JSON file for model building is as follows:

{

"modelType": 0,   

"modelMetaData":{

"fieldList": [

{

"varName": "PassengerId",

"dataType":20, // PassengerId is an ID type variable

"isTarget":false,

"isActionable": true,

"isSource": true,

"importance": 0 ,

"isComputeCol": false,

"isTypeDefined": false,

"isSourceDefined": false 

},{

"varName": "Survived",

"dataType": 1, //Survived is a binary variable 

"isTarget": true, //Survived is the target variable

"isActionable": true,

"isSource": true,

" importance ": 0

"isComputeCol": false,

"isTypeDefined": false,

"isSourceDefined": false

}, {

"varName": "Pclass",

"dataType": 3, // Pclass is the categorical variable

"isTarget": false,

"isActionable": true,

"isSource": true,

"importance": 0 ,

"isComputeCol": false,

"isTypeDefined": false,

"isSourceDefined": false

}, {

"varName": "Age",

"dataType": 11, //count variable

"isTarget": false,

"isActionable": true,

"isSource": true,

"importance": 0,

"isComputeCol": false,

"isTypeDefined": false,

"isSourceDefined": false

},

  //Next variable

  //

}],

"modelFile": "C:\\Program Files\\raqsoft\\ymodel\\documents\\csv\\train.pcf",

"dsType": 0,

"dsConfig": {

"srcFilePath":"C:\\ProgramFiles\\raqsoft\\ymodel\\documents\\csv\\train.csv",

"hasTitle": true, //

"fieldNames": ["PassengerId", "Survived", "Pclass", "Name", "Sex", "Age", "SibSp", "Parch", "Ticket", "Fare", "Cabin", "Embarked"],

"fieldTypes": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],

"useDisp": false,

"dispNumber": 0,

"charset": "GBK",

"separator": ",",

"isEscape": true,

"isTransQuota": false,

"checkValid": true,

"skipErrorRow": false,

"useTop": false,

"topNumber": 10000,

"useBlock": false,

"blockIndex": 1,

"blockCount": 1,

"dateFormat": "yyyy/MM/dd",

"timeFormat": "HH:mm:ss",

"dateTimeFormat": "yyyy/MM/dd HH:mm:ss", 

"missingFormat": "NULL|N/A",

"language": "zh",

"country": "CN",

"variant": "",

"formats": ["", "", "", "", "", "", "", "", "", "", "", ""]

},

"needPrepare": true,

"parallelNumber": 1,

"isIntelligenceImpute": true,

"isResample": true,

"balanceParams": [1],

"advanceSelect": false,

"optimalParam": false,

"resampleMultiple": 150,

"testDataPercent": 0,

"ensembleMethod": "best_n",

"resampleBestN": 3,

"resampleNumber": 5,

"ensembleFunc": "np.mean",

"ensembleBestN": 0,

"dataBalances": [1.5325202941894531],

"chunkSize": 1000000,

"adjustProb": true,

"fixedSeed": true,

"randomState": 0,

"classModels":[

"{\"modelName\"//model name:\"TreeClassification\",\"count\"//sample count:1}",

"{\"modelName\":\"GBDTClassification\",\"count\":1}",

"{\"modelName\":\"RFClassification\",\"count\":1}",

"{\"modelName\":\"LogicClassification\",\"count\":1}",

"{\"modelName\":\"RidgeClassification\",\"count\":1}",

"{\"modelName\":\"FNNClassification\",\"count\":1}",

"{\"modelName\":\"XGBClassification\",\"count\":1}"

],

"regressionModels": //List of regression model parameters [

"{\"modelName\":\"TreeRegression\",\"count\":1}",

"{\"modelName\":\"GBDTRegression\",\"count\":1}",

"{\"modelName\":\"RFRegression\",\"count\":1}",

"{\"modelName\":\"LRegression\",\"count\":1}",

"{\"modelName\":\"LassoRegression\",\"count\":1}",

"{\"modelName\":\"ENRegression\",\"count\":1}",

"{\"modelName\":\"RidgeRegression\",\"count\":1}",

"{\"modelName\":\"FNNRegression\",\"count\":1}",

"{\"modelName\":\"XGBRegression\",\"count\":1}"

],

"isEscape": true,

"classCategoryCount": 15,

"numberCategoryCount": 24,

"groupMaxCount": 3000000,

"groupMinCount": 1000000,

"accuracySetting": {

"accuracyMin": 0.05,

"accuracyMax": 0.95,

"accuracyCount": 20

}

},

"srcFilePath": "train3.mtx", mtx

"rowCount": 623,

"colCount": 12

}