3

I am working on a project which involves fitting a housing dataset and predict resale housing price using the random forest model - according to the Variance Importance plot, I made the following interpretation. (Refer to this image for plot >> VarImptPlot)

  • Based on the %IncMSE, town and remaining lease are important predictors that would result in at least 50% increase in MSE if their values were randomly shuffled.

My concern is that town is a categorical variable with over 20 classes and not all classes are shown in the plot, is it still possible to make a conclusion that town is a significant predictor given that many levels are ranked high up in the plot?

  • Based on the IncNodePurity, floor area and remaining lease are identified as important variables that would result in significant decrease in node impurities.

  • Hence town, remaining lease, floor area sqm are significant predictors that would influence resale price of HDB.

Is it correct to make an interpretation on the significance of predictors like this?

1 Answers1

1

Yes it is fair to say town is a significant predictor. Each town is used as a split node asking "Is the town Bukit Merah?" for example, to split the node into two children nodes, following the tree of either "Yes" or "No". Your modelling shows that the purity is high for this particular town, which is an indicator of a good predictor variable. The purity value represents how homogenous the samples of data after the split are. So you get one set of data in which the town is Bukit Merah and another set where the town is not Bukit Merah. Then once grouped compare the resale housing price for each data point in each group and the more similar they are in values, the higher their purity. As you can see some towns create a more homogenous grouping than others - but overall one can certainly demonstrate that town is a significant predictor of resale housing price in your model.

xiA
  • 436
  • Can you please point useful (and free) links on this topic? – Olivier Roche Dec 04 '19 at 10:43
  • 1
    Gini impurity is simply explained at the following link: https://victorzhou.com/blog/gini-impurity/ – xiA Dec 05 '19 at 11:20
  • Also imagine that you have a dataset of apples and oranges - each row has a fruit id and a Shape column and a Colour column. We also know the true fruit type i.e. the label, orange or apple. Which predictor variable has a greater purity? Or which predictor variable separates the fruit better? Now splitting the dataset on the Shape (i.e. is Shape round?) will result in a group of both apples and oranges, not very useful right? but splitting the data set on Colour (i.e. is Colour orange?) will split the fruit dataset into the correct seperate groups of apples and oranges. – xiA Dec 05 '19 at 11:31
  • Also https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76 and https://en.wikipedia.org/wiki/Decision_tree_learning – xiA Dec 05 '19 at 11:32
  • Thanks a lot for the useful links. :) – Olivier Roche Dec 06 '19 at 07:16