There's something you need to know about if you are building decision tree models using Python's famous scikit-learn package.
The algorithm it uses for building the models is deterministic (it produces consistent results across multiple executions if the inputs don't change).
Despite this nature, scikit-learn provides a 'Random state' hyperparameter to the decision tree's class. This hyperparameter is only needed when an algorithm is not deterministic, as fixing the random state to a constant integer value arrests the randomness.
So, the random state must be a redundant parameter when building decision trees, right?
Not really.
The decision tree algorithm could use the value of the random state passed to it for making a 'decision' in the below three cases:
i) If you set the max_features hyperparameter to an integer value lesser than the total number of features. It means the algorithm needs to decide which random subset of features to use at each node to determine the best feature for splitting, which makes it a stochastic process (opposite of deterministic).
ii) If you set the splitter hyperparameter to 'random' instead of 'best.' As the name suggests, it adds some randomness to the process of splitting.
iii) If, at any node, there is more than one feature that could produce the best split in terms of the maximum improvement in the splitting criterion, like gini or entropy. In this case, there is ambiguity on which feature it should use for splitting the data, which adds some inherent randomness.
Let us assume that you overlooked the importance of the random state parameter in the above three cases. To test whether increasing the depth of the decision tree would improve the evaluation metric being used, say, accuracy, you built two models with different depths and evaluated their accuracies.
But the failure to arrest the randomness while building the models means you can't simply select the model with the highest accuracy as the best one.
Hence, setting that random state to some constant integer beforehand could help you by making sure that you are comparing apples to apples and oranges to oranges.
And by the way, do you know why the random state parameter is set to '42' in scikit-learn's documentation and countless tutorials? It is because 42 is considered the most important number (at least by the fans of the sci-fi genre).
Don't believe it? Google "the answer to the ultimate question of life, the universe, and everything."
Comments
Post a Comment