Does Isolation Forest really perform well in its task?

Rajesh Sharma
DataDrivenInvestor
Published in
6 min readMar 31, 2021

--

This blog is the continuation of the previous one where we understood IF thoroughly and here we will work with its Sklearn implementation to witness if its fundamentals are competent to find outliers.

Let’s start!!

Image link

Import some packages

Loading the Dataset

Using Iris Dataset

Instantiate IF

Instantiating IF and its parameters
  • n_estimators — This parameter represents the number of trees or base estimators that you want to build. For example, I have assigned 15 as the number of base estimators or trees in the forest.

The selection of this parameter should be done wisely and always pick a number that can guarantee you the equal allocation of the number of records and features across trees.

  • max_samples — The number of observations you want to use for training every tree. For example, I have taken 25 which means 25 random records will be used to build an individual tree.

The idea behind this parameter is to ensure that every tree will get a different set of records.

  • contamination — Percentage of outliers you want to find or believe to exist in the dataset. For example, I’m saying to IF that 0.05 or 5% of records are outliers in the dataset.

This parameter should be selected based on the problem that you are solving or the dataset you are using. And, the more you understand the dataset, the more it will help you in selecting this value.

  • max_features — The maximum number of randomly selected features for training the tree. For example, I have taken 2 which means 2 random features will be used to build an individual tree.

We can also provide a floating value for assigning the maximum features. For example, max_features=0.2, then the algorithm will use (0.2 * d) features, here d is the total number of features in the dataset. In the Iris dataset, it will be (0.2 * 4) = 0.8 means only 1 feature.

Again, the idea behind this parameter is to ensure that every tree will get a different set of features.

  • bootstrap — Whether you want to perform the sampling of records with replacement or without replacement. For example, bootstrap=True, then one record can occur more than once in the 25 random records for training the tree (this is known as sampling with replacement), whereas bootstrap=False ensures that one record will exist only once in the 25 random training records (this is known as sampling without replacement).

The general idea behind this parameter is to use sampling with replacement when you are dealing with a large dataset. The reason for not going the other way because sampling without replacement can turn things (w.r.t probability calculation of every observation) complex after a certain point. Sampling without replacement is helpful with the smaller datasets.

  • random_state — While understanding the foundation of Isolation Forest we discussed that it uses the random axis-parallel lines which makes it stochastic. Thus, by providing this value we ensure the reproducible results for random selection of features and split or threshold value.
  • verbose — This will print some logs for you. Kindly refer to the below screenshot:
Training IF

Here, as I used the n_jobs=-1, thus all the cores are used for this training task and displayed some logs.

Base Estimator

Isolation Forest uses ExtraTreeRegressor as the base estimator.

Base Estimators

Now, in the above screenshot, you might be wondering that we didn’t provide the max_depth of a tree. Then, why every tree is initialized with max_depth=5?

The answer for that is it uses a mathematical way for assigning depth of tree:

Trees Depth

Here, n is the number of records in the dataset. For example, I have max_samples=25 and the total number of records in the dataset is 150, thus the depth comes out to be 5.

Random Selection of Features

Features randomly selected for every base estimator

Random Selection of Observations

Indices of observations randomly selected for training every base estimator
Validating length of every training set

How to interpret offset?

Once, the algorithm is done with the training phase then it defines the decision function for the prediction.

Decision Function

Decision_Function gives the average anomaly score of an observation based on its score from every tree. If it is less than 0 then it indicates the abnormality.

Score_Samples give the opposite of the average anomaly score of an observation based on its score from every tree. If it is less than offset then it indicates the abnormality.

Offset_ is defined as follows:

  • When the contamination parameter is set to “auto”, the offset is equal to -0.5 as the scores of inliers are close to 0 and the scores of outliers are close to -1.
  • When a contamination parameter is other than “auto”, then the offset is defined in such a way we obtain the expected number of outliers (samples with decision function < 0) in training.
Generating Scores
Results

Validating Offset

Let’s validate anomalies based on the criteria defined above in the decision function.

Anomalies validation based on Decision Function

Analysis

Let’s visualize the results.

Pair Plot

The first look does give an impression of points that are a bit far away(means less number of splits required for isolating them) labeled as Outliers.

Some more plots

Variation among Inliers and Outliers

Clearly, every feature has a good amount of variation for inliers and outliers.

Observation wise visualization

If we look at the outliers then they appear to be the extreme points that are either at the edges or boundaries of the clusters.

Distributions of Inliers and Outliers

Some gaps are quite evident in the above plots and the point to mention here is that a higher percentage of the outliers are from the extreme or higher values of features.

Visualizing Outliers based on Target categories

The above plot gives us a better idea about the presence of outliers based on the categories of the target.

Score-wise visualizing anomalies across target categories

The Anomalies in Virginica class appear to have a higher anomaly score as compared to others.

In conclusion, we found that Isolation Forest works well in distinguishing the outliers with the randomized approach for training records, features and branch split values.

Hurray, you have reached the end of this blog and I hope you would have enjoyed it. Don’t forget to tap some claps.

Just give me the code

Here you go!!

--

--

It can be messy, it can be unstructured but it always speaks, we only need to understand its language!!