Creating a Statistical Model to Predict Titanic Survivability
Last week, the Titan submarine made headlines as it unsuccessfully attempted to observe the wreckage of the Titanic. All 5 members aboard the small Titan vessel were presumed dead as debris from the implosion was found a few days after the ship had lost communication with land. While the Titan implosion did not spare a single passenger, a fair share of passengers made it out of the Titanic disaster in 1912. It may come as no surprise that there are trends in survivorship—after all, countries and cultures around the world put an effort to prioritize women's and children's safety in times of emergencies. Jack wasn’t the only one to save himself for a Rose, right?
So, using actual passenger data, let’s analyze trends in the data and see if identify patterns in the survivorship of the Titanic.
Data was obtained from a Kaggle Machine Learning competition.
Using passenger class as a proxy for socio-economic status, we see, unsurprisingly, that the proportion of survivorship increases among wealthier folks.
Whereas 75% of 3rd Class passengers died, roughly 62% of 1st-class passengers made it out alive. Approximately half of 2nd Class passengers survived.
Interestingly, the center of age does not seem to differ much between the two distributions. Both distributions have a center in the late 20s—both subsets have a median of 28, to be precise. Both distributions have a slight right skew. It’s interesting to note that there is a greater frequency of younger (between the ages of 0 and 10) passengers that survived. However, as a caveat, a greater frequency of survival does not necessarily imply a priority of survival because it would be reasonable to assume that people of different ages have different rates of survivability.
While there is not much difference in ages between male passengers that died and survive, the median age of female passengers that survived is markedly greater than females that died. To me, this result was somewhat counterintuitive because one would expect daughters and younger mothers to have priority in survival. This visual challenges that notion. It could be possible that slightly older ladies are more resilient or have better survival techniques.
After finding the general picture of the data, I then created a binary logistic regression model to determine what factors would play into an increase in odds of survival and which ones wouldn’t. This model is necessarily binary since there are only two outcomes—survival or death. Using this model, I could then use input data from other passengers (not in the base dataset) and predict whether they would survive or not.
My first model would include the variables age, sex, and class. The resulting model yields:
p is defined as the probability of survival. All p-values for the coefficients and intercepts were all less than 0.05, meaning that across the board, all of the variables have a significant association with the log(odds) of survival.
I wanted to see how much class would impact the probability of survival. Our model would predict a 28-year-old male to have a 0.555 probability of survival in 1st class, a 0.252 probability of survival in 2nd class, and a 0.086 probability of survival in 3rd class.
Note, that the relationship between odds and probability is p = odds / (1 + odds).
Then I wanted to see how the probability of survival would decrease among females and males. As noted earlier, a 28-year-old male in 1st class would have a 0.555 probability of living, but we found that a 28-year-old female in 1st class would have a 0.939 probability of living.
In the second class, a 28-year-old male has a 0.252 probability of surviving whereas a 28-year-old female has a 0.808 probability of surviving, according to our model. Finally, in the third class, a 28-year-old female has a 0.541 probability of surviving. Shockingly, our model suggests that a third class female passenger has almost the same probability of surviving as a 1st class male passenger.
Finally, what effect does age have on the probability of survival? The probability of survival between a 2nd class of male passengers of different ages was compared.
10-year-old: 0.396
28-year-old: 0.252
46-year-old: 0.148
Perhaps unsurprisingly, an increase in age is related to a decrease in the probability of survival.
There’s also a variable of traveling size we should take into account for our model. For example, would traveling by oneself be safer than traveling with a large party? I assumed that traveling alone, in this situation, would be safer than traveling with others, with the assumption that one only needs to look out for oneself.
In our dataset, the variable sibsp measures the number of siblings and/or spouses aboard the Titanic for a given passenger, and the variable parch measures the number of parents and/or children aboard the Titanic for a given passenger. When accounting for these variables in our logistic regression model, the sibsp variable yields a coefficient of -0.368, and parch yields a coefficient of -0.0386. However, while the sibsp variable’s p-value was relatively small, the parch coefficient yields a p-value of 0.74685. This means that when we assume no association between the log(odds) of survival and the number of parents or children that a passenger travels with (and other variables are kept constant), there is quite a large probability of observing the relationship of -0.0386 in a random sample of similar size. To be thorough, we created models with the combinations of sibsp and parch, just parch, and just sibsp. Consistently, parch has p-values that were very large, whereas all of sibsp’s p-values were less than our alpha value 0.05. This suggests that there is no statistically significant evidence to suggest that the number of parents or children has an effect on the odds of survival…very surprising.
Using our model, we estimate that a 28-year-old man in 2nd class with no siblings or spouses (onboard) has a 0.277 probability of survival. That same man with 1 sibling and/or spouse has a 0.207 probability of survival. And the same man with 2 siblings and/or spouses has a 0.1516783. In general, we can say that an increase in siblings and/or spouses decreases the chance of survival.
However, if we want to build the most accurate model possible, we must also take into account interaction effects. Interaction effects are when the effect of one variable is dependent (or alters) by a different variable. For example, often doctors prescribe certain medicines under the specific condition that you are not taking painkillers or other types of drugs. That’s because the prescribed medicine affects your body differently when you are taking other drugs.
We created a logistic regression model including the interaction effect between sex and age. This would show us whether the effect of age on one’s survivability is different by gender. Our interaction term (SexMale*Age) returned a coefficient of -0.0507 (p-value = 0.0008). Different combinations of interaction terms were also attempted, such as the interaction effect between socioeconomic status and sex. Our 3rd class and male interaction effect yielded a coefficient of 1.704 (p-value 0.0245), while other interaction terms were deemed not statistically significant.
As a final point, in order to create the most accurate multiple logistic regression model possible, we would have to minimize Akaike’s Information Criterion value (AIC). This value estimates the amount of prediction error in a model, and thus a lower AIC is indicative of a more parsimonious model. Whereas an R-squared value in a multiple linear regression model decreases by adding more terms, an AIC model can penalize a model for adding unnecessary terms. When messing around with different variables and interaction terms, we came up with the following model:
This model yielded an AIC of 614.69, whereas the first model we created with terms of Class, Sex, and Age only yielded an AIC score of 657.28. As of right now, this is the most accurate model that I could come up with it.
With statistics, it’s always important to be thorough and clear in the results. While we may have found certain associations between different variables and the probability of survival, it is paramount to understand that this is merely a model drawn from observational data. We may notice that when controlling for other variables, being a woman is associated with greater survivability, but we must not jump to conclusions of causality. There are a plethora of untested and unseen reasons behind this association—perhaps the passengers prioritized women’s safety, women were more resilient to the conditions, or women were more prepared for a disaster. Nonetheless, the data shows a remarkable story of what conditions and characteristics were associated (and not related) to survival on the Titanic.
Sources:
https://www.kaggle.com/competitions/titanic/overview
https://en.wikipedia.org/wiki/Titanic