Hello everyone. Last time, I wrote about “Root cause analysis” as “Problem Solving Practice Edition”, with “Another dimension of countermeasures against the declining birthrate” as its material. This time, I would like to write about “Decision tree analysis,” which is a classic of data mining and is also familiar with the current hot topic of machine learning.
1. What is “Decision Tree Analysis”?
“Decision tree analysis” is a method for extracting independent variables that have the greatest impact on an outcome from a data set with multiple columns. It literally displays the analysis results in a tree structure that is easy to understand. This is probably the main reason why this method has been popular for a long time. If you google about the “Decision Tree Analysis” itself, you will find many articles, so I won’t write about it in detail here.
2. Difference between “Regression Analysis” and “Decision Tree Analysis”
As I wrote in the previous post, when you find data with many columns, such as customer survey results, you can use multiple regression analysis (linear, logistic). Additionally, this decision tree analysis is also useful for those cases. Among multiple columns, you can discover independent variables that have a large impact on the outcome.
So what is the difference between multiple regression analysis and decision tree analysis? How should we use them? Let me summarize the main points.
There may be other things, but I think these are the main ones. In short, decision tree analysis has fewer constraints, so I think it will take less time to prepare if you try decision tree analysis first.
On the other hand, I think the advantage of multiple regression analysis is that it allows you to create predictive formulas. Therefore, for example, in the case of data with a large number of columns, it is possible to use a combination technique such as first using decision tree analysis to narrow down the independent variables that are likely to have an impact to a certain extent, and then applying multiple regression analysis.
3. Procedure of “Decision Tree Analysis”
Decision tree analysis cannot be performed in Excel, so this time we will use the statistical tool “R (free tool)” introduced in the previous post on logistic regression analysis. R is also a very useful tool that can handle various data mining (statistical analysis) methods such as multiple regression analysis and cluster analysis, etc. If you google R itself, you will find many articles, so I won’t write about it in detail here.
There are various ways to use R, but like the previous post, this time we will also use “Google Colaboratory (Hereafter, “Colabo”. This is also free. !)” to run R. First, prepare to run R on Colabo, and then read the file to be analyzed. The steps up to this point are common to all analysis methods, so please refer to the previous post.
As for the file to be analyzed, I will use the same file used in the previous post on logistic regression analysis. I’m looking forward to seeing if the results differ depending on the analysis method. By the way, the contents (columns) of the file look like this.
Gender
Age
Age group
Marriage status
Living prefecture
Private brands you know (multiple choice)
Image by private brand (multiple choice)
Change in purchase frequency of private brands in the last 1-2 years (increase/decrease/no change)
Reasons for the above (free answer)
Food products that have switched to private brands within the past year (multiple choice)
Seasonings that have switched to private brands within the past year (multiple selections)
Beverages that have switched to private brands within the past year (multiple choice)
Drugs that have switched to private brand within the past year (multiple choice), etc.
Looking at the raw data columns above, the first thing that seemed to be the outcome variable is “Change in purchase frequency of private brands in the last 1-2 years (increase/decrease/no change). Also, it seemed that “Foods/seasonings/beverages/drugs that have switched to private brand within the past year” can also be used.
Now, let’s load the above file and run the decision tree analysis. First, on R, prepare to use the library “rpart” for decision tree analysis and the library “partykit” for displaying analysis results.
Next, command for decision tree analysis.
“rpart” on the right is the decision tree analysis command, and here the execution result is read into the variable “decision_tree_test”. Regarding the rpart arguments, “Freq” in parentheses is the outcome variable, and the next “~.” means that all remaining columns are the independent variables.
Looking at the results, “n” is the number of data items, which is 697 this time. It says “node), split, n, deviance, yval”, but the following numbers are displayed in this order. “split” is the column name that categorizes the data, “n” is the number of data items that fall into that category, and “yval” indicates the percentage of “outcome variable = people who have changed the frequency of purchasing private brands” in that category (You can ignore “deviance” for now).
When we read the results according to the above, we found that “Number of foods switched to private label within the past year” had the greatest impact to the outcome variable “Change in the frequency of private brand purchases over the past 1-2 years (1 = increased, 0 = unchanged/decreased)”. If the number of food items is less than 1.5, the next most impactful item is “number of seasonings that have switched to private brand within the past year,” and if the number of seasonings is 1.5 or more, the number of people who have changed their purchase frequency is the majority, 45% of the 20 people (=9 people) belong to this category.
It’s a little hard to see if it’s just numbers, so I’ll try using “partykit”, a library for displaying analysis results.
It’s easier to see the positional relationship here. It is exactly the shape of a decision “tree”. However, even when we use PartyKit, the text might get distorted and it is a bit troublesome to make adjustments, so when I actually use it for presentations, I also create a tree diagram by myself using PowerPoint. This is also a matter of personal preference here.
Now, how do you compare this result with the results of the previous logistic regression analysis? In the logistic regression analysis, the impact was determined in the order of “Price (what is important when purchasing private brand products – [price])” and “Foods (number of foods switched from manufacturer products to private brand products within the past year)” were the largest, followed by “Spices (number of seasonings switched to private brand within the past year)” was the third.
In this decision tree analysis, “Price” has disappeared, and “Foods” and “Spices” have a greater impact. Since the analysis logic is different, even if you use the same data to examine the impact of independent variables on the outcome in the same way, the results may naturally differ.
Therefore, as I wrote an example of a combination of first performing a quick analysis using decision tree analysis to narrow down the independent variables to a certain extent, and then performing multiple regression analysis, there is a possibility that unexpected discoveries like this may come out, I would recommend trying both analysis without sparing effort.
This time, I wrote about the steps of decision tree analysis, how it differs from multiple regression analysis, and how to combine them. I hope this will be some of your help.
That’s all for this time, and I would like to continue from the next time onwards. Thank you for reading until the end.