The section covers data source type, variable type detection and export data quality report.
Click “New model” in “File” drop-down menu or click on the toolbar to get “Load data” window. There are three types of data sources: “Local data file”, “Database type” and “Remote server”.
Check “Local data file”, select a folder under “Look In” and all available files for model building will be listed. File types supported by YiMing AI Model include *.mcf, *.mtx , *.txt and *.csv.
Load mtx file
Since “File options” have already been configured when generating the modeling table, you can select the mtx file and click “OK” to directly input data.
Load non-mtx data file
To load a non-mtx data file, you need to perform data loading configurations and select to-be-loaded variables, take train.csv for example:
Select the file and click “OK” to get “Load data” window:
The configurations are the same as those explained in File options.
With all these options, you click “Next” to move on to “Import data dictionary” window, as shown below:
Select variables you need on the left side of the window. By default all variables are checked.
“Select”: Check boxes to include desired variables; otherwise they will be excluded.
“Type”: You can modify data type for a variable; or click to select pre-defined data dictionary file to import variable types.
The data dictionary supports four types of files – xls, xlsx, txt and csv. Its format is shown below:
(1) The first column contains variable names.
(2) The second column contains variable types, which include Identity (ID), Categorical (for classes or divisions of people or things), Binary (a double-value), Unary (a single-value), Count (number of people or things), Numerical (a number), Datetime (date and time), and Text string (Long text data). You can set an eligible type for a working data set.
(3) The third column contains format for date data (yyyy-MM-dd), time data (HH:mm:ss), datetime format (yyyy/MM/dd HH:mm:ss).
(4) The fourth column records that whether the current variable type is in use. A true value represents that it is used and false unused.
(5) The fifth column records the degree of importance of the current variable.
Below is a data dictionary in CSV format:
Note: Variables in a CSV file are separated by commas, and those in a txt file are separated by tabs.
On the right side of the “Preview data” screen, users can set the number of lines displayed on the screen and then click “Reload” to refresh the data.
Now click “Finish” to get “Detect variable type” window:
Select desired options (detailed configurations will be explained in Detect variable type) and click “OK” to bring data in.
You need to perform the following configurations in order to connect to a database source:
1. Click “Data source” option under “Tools” to get the following window:
2. Click “New” button to get “Database type” box. There are two database types: JDBC and ODBC.
Check “JDBC” and click “OK” to get “Datasource” window where you can configure the database data source:
1. “Datasoursce name”: Any name you want, such as ora and SQL17.
2. “Database vendor”: Select a desired database in the drop-down menu; select “Others” if there isn’t the one you want.
3. “Driver”: You can enter one listed in the drop-down menu, or type in other driver information. The driver jars for connecting to a database need to be copied into【Designer installation directory】/common/jdbc.
4. “Datasource URL”: The URL of the database to be connected. Take Oracle as an example, it is jdbc:oracle:thin:@192.168.0.68:1521:orcl.
5. “User”: Database user name.
6. “Password”: The password for accessing the to-be-connected database.
7. “Qualify object with schema”: You can decide whether to add schema before a table name, such as dbo.employee by checking or unchecking this option.
8. “Enclose object name in quotes”: By checking or unchecking this option, you can decide whether to quote object names.
9. “Auto-commit”: You can control the database commit by checking or unchecking this option.
“Extended properties”: On this tab, you can configure more parameters for a database. Different databases have different parameters, which are automatically added according to your database after the “General properties” configurations. Some databases, such as Access, don’t require extended parameters.
After all these configurations, click “OK” to return back to “Datasource” window, where the newly-created data source is listed.
Click “Connect”, and if the data source name turns pink, the data source connection is established.
Check “ODBC” and click “OK” to get “ODBC datasource” window:
1. “Datasoursce name”: Any name you want, such as ora and SQL17.
2. “ODBC name”: The ODBC data source name configured in ODBC datasource manager.
3. “Username”: Database user name.
4. “Password”: The password for accessing the to-be-connected database.
5. “Qualify object with schema”: You can decide whether to add schema before a table name, such as dbo.employee by checking or unchecking this option.
6. “Case sensitive”: You can specify whether a SQL query is the same with both lowercase and uppercase.
7. “Enclose object name in quotes”: By checking or unchecking this option, you can decide whether to quote object names.
After all these configurations, click “OK” to return back to “ODBC datasource” window, where the newly-created data source is listed.
You can specify whether to encrypt the password defined in “Datasource” window by selecting one option under “Encryption level” drop-down menu. This corresponds to the encryptLevel property in userconfig.xml file whose value is 0 (plaintext) or 1(Encrypt password).
“Plaintext”: When this is selected, the value of encryptLevel in userconfig.xml file will be automatically set as 0. In this case, the password won’t be encrypted, as shown below:
“Encrypt password”: When this is selected, the value of encryptLevel in userconfig.xml file will be automatically set as 1. In this case, the password will be shown in an encrypted text string, as shown below:
Check “Database type” to configure information including data source, schema, table, fields, etc. For detailed explanation, see Remote data source. Then click “OK” to bring data in.
You can also build a predictive model in a remote server.
Check “Remote server”, as shown below:
Select a server and you’ll get a list of all modeling table files under data file directory. Select a desired file and click “OK” to bring data in.
The system will ask you to detect variable type when you load a non-mtx file or a database table to build a new model.
To make sure that you input variables with correct types, you can check “Detect variable types now”. For a large file, you can check “Do not detect (Can be detected later via "Detect variable data type" option)” to perform detection in “Detect variable data type” menu.
By checking “Do not detect (Can be detected later via "Detect variable data type" option)”, data will be imported with default variable type “Categorical variable”. Later you need to perform detection in “Detect variable data type” menu on the main screen, or set the type for each variable manually. You’ll get an error report during model building process if you skip the detection or set wrong variable types.
If you check an option, the choice will be observed the next time you open the “Load data” window. You can change the configuration in Tools ->Configuration, as shown in the following example:
After modeling data is brought in, you can click “Export data quality report” in “File” menu to generate a report about your modeling data quality. The exported report is a pdf file.
The data quality report includes the number of variables, number of lines, missing value rate, numerical variables deviation, association strength between categorical variables, and other information.