Since the 2019.3 release, Tableau Prep has included the ability to run scripts leveraging R and Python. This opens up a whole new world of possibilities since this greatly expands the capabilities of Tableau Prep (See Make your own Table Calculations with Python in Tableau Prep Builder).
Today, we'll explore how R can be used within Tableau Prep Builder in order to solve one of the most popular machine learning competitions from Kaggle, the Titanic case. The RMS Titanic was the biggest ship ever built and was considered invincible. But on the night of April, 14th 1912, it hit an iceberg that would lead to one of the most deadly naval incidents ever.
A small summary of the movie in case you’ve never seen it.
The dataset comes in two parts: a training set and a test set. I’ll use the train set to build and select the best model in order to make a proper submission on Kaggle.
The first step is to have a look at the data and that’s something where the interface of prep is very useful. You can with one glance get a good idea on the distribution of the variables like Age.
You can see that Tableau Prep makes it easy to count the null values within Age or see the skewness of the Fare distribution.
From there, I already notice that the dataset contains some NULL values in the Age variable. To solve this, we will simply replace those NULL with the mean age for each group Sex-Pclass (one may argue that this is not accurate as it is a ‘naive’ method but I will leave it for now). I would like to extract the Title from the Name field and create a new field called family size that contains the number of relatives travelling with a passenger.
Let’s also have a look at other key variables: Parch refers to the number of parent/children; Sibsp refers to the number of siblings/spouses. I will use those two variables later to calculate the number of persons that were travelling with the passenger. There is also Pclass that I mentioned earlier refers to the class of the passenger.
We are then left with the ticket number, the cabin number, the fare of the ticket, the name of the passenger and the location where they embarked. I will not use those variables here but it could be interesting to use them when you start testing your model.
How to make Tableau Prep understand R
I need to create a R script that contains the basic structure of the code that Tableau Prep is able to read. Tableau Prep needs two functions: the first one will contain the actual R code you want to use on your data; the second one is giving Tableau Prep the information on how to return the data.
The second function is used to define a certain output schema. If you want to return only certain columns or extract certain information from your R output, you can use this function to define the name and the type of the fields.
In the data cleaning step, I append the new fields directly in the current dataset.
In the getOutputSchema, I define the fields I wanna keep along the new ones I just created.
As mentioned before, I would like to append the mean age for each group Sex-Pclass. I extract the mean age for each combination and save it in a data frame that I will join back with the existing file.
I will do the same for the test dataset as it also contains missing Age values. As part of the data preparation for fine tuning my model, I create a Title and a family_size variable. The title variable refers to the title of the passenger. Since the title is always at the same place in the name variable, it was convenient to use a regular expression to extract the titles from the names of the passengers. The family size is a simple sum of the number of parents/childrens, the number of siblings/spouses and 1 to count the passenger itself.
Survival rates on the Titanic weren’t really distributed equally as you’ll find out pretty soon.
To be able to use both dataset for the model building and the prediction, I need to union the two sets. I also rename the Table names to: test and train. This will help me to distinguish both datasets in the next steps. I add a clean step to rename the field as you can’t do it directly in the Union step. The default name ‘Table Names’ is not really compatible with the R way of coding. It is recommended to use a snake case (table_name) or a camel case (tableName).
I also fine tuned the Title variable by gathering the less frequent title in bigger category: Rare. This is important as those titles are only appearing either in the train set or the test set (e.g.: Dona only appears in the test set). This would affect the model as It won’t be able to predict Survived based on values that were not seen before.
The union step allows you to union your different steps.
Let’s build some models
Now, I will enter the core of the flow. I need to write a script to build a model. I will use two popular algorithms: the logistic regression and the random forest algorithm. To assess each model, I wrote two scripts that will return four key metrics used to measure the accuracy, the precision, the recall and the AUC. The accuracy refers to the ratio of good prediction over the total amount of predictions. This gives us an overall performance kpi of the model. The precision determines how accurate are the positive predictions (vs. false positive). The recall gives an indication on how does the model performs when classifying true positive vs. false negatives. Finally the AUC or Area Under the Curve refers to the rate at which the model is able to identify true positive and true negative (you can find more info about this on the following page: Machine Learning Cheat Sheet - CS229 Stanford University).
To return the result of the model, the whole workflow needs to be wrapped in the main function.
Reading the code above, you might have noticed that I split the training dataset in two: a train and a test set. The test set will serve as a validation step in order to make sure that the model is not over- or underfitting. Validating the model on the training data would mean that we are showing known data to the machine which will, most of the time, result in a very high accuracy.
It’s all about modeling here
In the code for logistic regression, I used the GLM function from the stats package. This function uses a formula to determine what is the dependent variable and the independent variables. I have defined the Survived variable as the dependent variable. That is what we are going to predict using the model and the test set. I defined a certain set of independent variables: Pclass (the class of the passenger), the family size, the title, the gender and the age.
For the Random Forest model, I chose the ranger function from the ranger package. It follows the same syntax as the GLM. You can see that you have extra arguments: they determine the number of trees I want to have in my random forest (ntrees) and the number of independent variables used to make a decision at each node in the trees.
The results of both scripts make it clear that the logistic regression is scoring better. I will then use it to predict the Survived variable for the test set.
Metrics for the logistic regression
Metrics for the random forest
The final step is to get the prediction for the test set. For this I re-use the code from the logistic regression script but instead of using the validation set, I’ll use the test set that doesn’t contain the Survived variable.
You can see that the code is re-using a lot of the code from the model scripts as the result of the models are not kept in memory.
The final clean step serves to format the data set to fit with the requirements from Kaggle. I submitted my results and I obtained an accuracy of 0.7790 which means that I’m able to predict correctly ~78% of the test set.
Did the passenger survive or not?
Some final thoughts on Tableau Prep and R
Being able to add R scripts to Tableau Prep is a great add on as it extends the data prep functionalities to include some advanced prepping techniques as well as built-in analytics. As Tableau Prep gets more functionality like being able to write into databases (coming in 2020), it is getting closer to other data prep tools such as Alteryx or even SAS which is still the standard for many companies.
Another advantage of combining both tools is the ability to visualize your data flow when doing advanced analytics. I could have accomplished the different steps only by writing R code in my preferred IDE (Rstudio). This reduces the time to understand the objective of the data flow.
Nonetheless, there is only limited documentation provided by Tableau itself regarding this integration (see https://help.tableau.com/current/prep/en-us/prep_scripts._R.htm). If you are not familiar with Tableau Desktop and R or R in general the learning curve might be steep and require some trial-error steps. This might also result in an error with a relatively undetailed message like: “Something went wrong when running the script. Verify that there are no errors in the script, then try again”. In my case, it was straightforward as the script was not too lengthy but it might become harder for longer scripts.
The whole project is available on GitHub: https://github.com/biztory/tableau-prep-r-ml-titanic/
Charles Laporte | Analytics Consultant @ Biztory