![]() The one downside to the way the random forest was set up is that I ran a GridSearch, to search many combinations of hyperparameters (parameters I set), but doing so does not allow for the extraction of feature importance & thus removes interpretation.Īn XGBoost regression model, or Extreme Gradient Boost, is a boosting model which fits an initial weak linear model and then iteratively fits weak models onto the residuals. I chose this model over other tree types because of the random choice of features: there are 48 features in the data set with original features. However, it is an improvement over the standard decision tree because it incorporates two levels of randomness: it bootstraps (random selection with replacement) rows & then it chooses a random set of features. In terms of model evaluation, I compared model performance on each set of features.Ī random forest is a tree model: it uses a decision tree to determine to predict values. When I modeled, each model was run on both sets of data. My imports are fairly standard & I included three model types: linear, tree, & boosting models.īecause of the way I engineered my features I had to create two sets of data: one with the original features & a second with the engineered features. Modeling import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import xgboost as xgb from math import sqrt from sklearn.linear_model import LinearRegression, LassoCV from sklearn.linear_model import RidgeCV, ElasticNetCV from trics import r2_score, mean_squared_error from trics import mean_absolute_error from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.model_selection import cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestRegressor from sklearn.pipeline import Pipeline 34 of those features represent either scarification or ischemia (reduced blood flow) in each of the 17 segments of the heart. My data has 48 features including the target variable, the EDV. I ended up using fancyimpute’s KNN imputer because the missing data was discrete and ordinal.Īpart from imputing data, I just had to make sure ordinal text was on a numeric scale. I spent a lot of time reading about data imputation because it is a very touchy subject: if you do it incorrectly, you can dramatically skew your data (a topic for a future blog post). Luckily for me, the data was very clean: only four columns had significant missing values. Preprocessingīefore I was able to jump right into modeling, I had to process the data. If a model can predict the EDV accurately, it could help cardiologists determine who needs help the most & improve efficiency in the healthcare system. The LV’s function is also an indicator of overall cardiac function. Once the muscular wall thins, the LV cannot pump efficiently which causes a whole host of problems. The LV cannot expand like a balloon: to allow for increased volume the muscular wall has to thin.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |