The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
從題目可以知道,這是一個 binary classification,最初想到 SVM 和 perception。
從題目給的數據,選擇 Decision Tree 或 Random Forest 可能是比較合理的想法。不過這邊我想用 Logistic Regression 來試試 (sigmoid + cross entropy)。
把訓練資料的內容全部都變成 0-1 的數字,剩下的就交給 NN 去解決。因為我們最後一層的 active function 是 sigmoid,為了避免梯度消失,因此在做 cross entropy 時把最大最小值定為 0.00001 和 0.99999,做每次的訓練時才不會有 Nan 的問題。
結果
Kaggle : 0.76555
分數只有這樣,大概有幾個地方需要檢討:
- overfitting??:train 可以到 90% 但是 test 最高就是這數字。除了 overfitting,另外一個就是資料的考量,因為有故意捨去某些資料來做訓練,可能留下的在測試資料中反而是缺失的。
- 解決 overfitting 的方式:選用 dropout 可能在這裡沒有比 regularization 還好,這需要調整。
- 填補資料的方式:在空白資料上很多是填上零或者平均值,有些隱藏相關沒考慮到?
- feature:最可能就是 feature 的問題了,因為在類似的作法下,使用 XGB 試過也沒有好多少,因此應該要嘗試其他表現方式。
原本想考慮好好用 Random Forest 和 XGB 認真做一次,想想應該真的是在 feature 上有問題;同樣用 Deep learning 來做的人,肯定也有做到非常高。
先邁入下個試題,希望回頭後有新想法。