The InterSystems IRIS IntegratedML feature is used to get predictions and probabilities using the AutoML technique. The AutoML is a Machine Learning technology used to select the better Machine Learning algorithm/model to predict status, numbers and general results based in the past data (data used to train the AutoML model). You don't need a Data Scientist, because the AutoML it will test the most common Machine Learning algorithms and select the better algorithm to you, based in the data features analysed. See more here, in this article.
InterSystems IRIS has a built in AutoML engine, but allows to you use H2O and DataRobot too. In this article I will show to you each step to use the InterSystems AutoML engine.
Step 1 - Download the Sample app to do the exercises
1. Go to https://openexchange.intersystems.com/package/Health-Dataset
2. Clone/git pull the repo into any local directory
$ git clone https://github.com/yurimarx/automl-heart.git
3. Open a Docker terminal in this directory and run:
$ docker-compose build
4. Run the IRIS container:
$ docker-compose up -d
Step 2 - Understand the Business Scenario and the data available
The business scenario is to predict, using past data, heart diseases. The data available to do this, it is:
SELECT age, bp, chestPainType, cholesterol, ekgResults, exerciseAngina, fbsOver120, heartDisease, maxHr, numberOfVesselsFluro, sex, slopeOfSt, stDepression, thallium FROM dc_data_health.HeartDisease
The data dictionary to the HeartDisease table is (source: https://data.world/informatics-edu/heart-disease-prediction/workspace/data-dictionary):
Column name | Type | Description |
age | Integer | In years |
sex | Integer | (1 = male; 0 = female) |
chestPainType | Integer | Value 1: typical angina -- Value 2: atypical angina -- Value 3: non-anginal pain -- Value 4: asymptomatic |
bp | Integer | Resting blood pressure (in mm Hg on admission to the hospital) |
cholesterol | Integer | Serum cholestoral in mg/dl |
fbsOver120 | Integer | (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) |
ekgResults | Integer | Resting electrocardiographic results -- Value 0: normal -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) -- Value 2: showing probable or definite left ventricular hypertrophy |
maxHr | Integer | Maximum heart rate achieved |
exerciseAngina | Integer | Exercise induced angina (1 = yes; 0 = no) |
stDepression | Double | ST depression induced by exercise relative to rest |
slopeOfSt | Integer | The slope of the peak exercise ST segment -- Value 1: upsloping -- Value 2: flat -- Value 3: downsloping |
numberOfVesselsFluro | Integer | Number of major vessels (0-3) colored by flourosopy |
thallium | Integer | 3 = normal; 6 = fixed defect; 7 = reversable defect |
heartDisease | String |
Value 0: < 50% diameter narrowing -- Value 1: > 50% diameter narrowing
|
The heartDisease it is the property that we need predict.
Step 3 - Prepare the train Data
The HeartDisease table has 270 rows. We will get 250 to train our prediction model. To do this, we will create the following view inside Management Portal > Systems Explorer > SQL:
CREATE VIEW automl.HeartDiseaseTrainData AS SELECT * FROM dc_data_health.HeartDisease WHERE ID < 251
Step 4 - Prepare the validation Data
We will get 20 rows to validate the results of the prediction. To do this, we will create the following view inside Management Portal > Systems Explorer > SQL:
CREATE VIEW automl.HeartDiseaseTestData AS SELECT * FROM dc_data_health.HeartDisease WHERE ID > 250
Step 5 - Create the AutoML model to predict Heart Disease
The IntegratedML allows you create an AutoML model to do predictions and probabilities (see more in https://docs.intersystems.com/irislatest/csp/docbook/DocBook.UI.Page.cls?KEY=GIML_BASICS). To do this, we will create the following model inside Management Portal > Systems Explorer > SQL:
CREATE MODEL HeartDiseaseModel PREDICTING (heartDisease) FROM automl.HeartDiseaseTrainData
The model it will get training data (learning from) from automl.HeartDiseaseTrainData view.
Step 6 - Execute the Training
Execute the training. To do this, we will execute the following SQL instruction inside Management Portal > Systems Explorer > SQL:
TRAIN MODEL HeartDiseaseModel
Step 7 - Validate the model trained
To validate the training, we will execute the following SQL instruction inside Management Portal > Systems Explorer > SQL:
VALIDATE MODEL HeartDiseaseModel FROM automl.HeartDiseaseTestData
We did validate the HeartDiseaseModel using testing data from the automl.HeartDiseaseTestData view.
Step 8 - Get the validation metrics
To see the validation metrics from the validation process, we will execute the following SQL instruction inside Management Portal > Systems Explorer > SQL:
SELECT * FROM INFORMATION_SCHEMA_ML_VALIDATION_METRICS
To understand the results returned see https://docs.intersystems.com/irislatest/csp/docbook/DocBook.UI.Page.cls?KEY=GIML_VALIDATEMODEL.
The InterSystems IRIS documentation detail the following from the validation results:
The output of VALIDATE MODEL is a set of validation metrics that is viewable in the INFORMATION_SCHEMA_ML_VALIDATION_METRICS table.
For regression models, the following metrics are saved:
- Variance
- R-squared
- Mean squared error
- Root mean squared error
For classification models, the following metrics are saved:
- Precision — This is calculated by dividing the number of true positives by the number of predicted positives (sum of true positives and false positives).
- Recall — This is calculated by dividing the number of true positives by the number of actual positives (sum of true positives and false negatives).
- F-Measure — This is calculated by the following expression: F = 2 * (precision * recall) / (precision + recall)
- Accuracy — This is calculated by dividing the number of true positives and true negatives by the total number of rows (sum of true positives, false positives, true negatives, and false negatives) across the entire test set.
Step 9 - Execute the predictions using your new AutoML model - the last step!
To see the validation metrics from the validation process, we will execute the following SQL instruction inside Management Portal > Systems Explorer > SQL:
SELECT *, PREDICT(HeartDiseaseModel ) AS heartDiseasePrediction FROM automl.HeartDiseaseTestData
Compare the columns heartDisease (real value) and heartDiseasePrediction (the prediction value)
Enjoy!
Yuri,
Thanks for releasing this app. I've hit a couple of snags that you might be able to help with.
1. The table you reference for creating the training and test data views is SQLUser.HeartDisease. I don't see this table in the Management Portal, but perhaps you meant to use the dc_data_health.HeartDisease table to create the training and testing views?
2. Using the dc_data_health.HeartDisease table works as expected for creating the training and test data, and creating a model based on the training data view appears to work as expected. However, when I execute the 'TRAIN MODEL HeartDiseaseModel' query, I get this error:
[SQLCODE: <-185>:<Predicting Column only has one unique value in the dataset>]
[%msg: < Label column only has one unique value in the dataset.>]
Any thoughts on what the issue might be?
Thanks again - Don Martin
@Don Martin
Can you try this similar app? App: https://openexchange.intersystems.com/package/Predict-Maternal-Risk
It is from the same origin and I need to know if you get the same error.
@Yuri Marx
The Predict-Maternal-Risk app linked above worked great! No errors, and I was able to get all the way through the entire set of ML queries to build, train, validate, and predict risk using the training and validation data.