Data Science Day 12: Odds Ratio
Learning Objective:
Probability vs Odds Vs Odds Ratio
1. Probability = Event/Sample Space
2. Odds= Prob(Event)/Prob(Non-Event)
3. Odds Ratio = Odds(Group 1)/ Odds(Group 2)
Interpretation
The Odds Ratio is a measure of association between exposure and outcome.
OR=Odds(Group 1)/Odds(Group2)>1 indicates the increased occurrence of an event in Group 1 compared to Group 2.
OR=Odds(Group 1)/Odds(Group2) < 1 indicates the decreased occurrence of an event in Group 1 compared to Group 2.
The true Odds Ratio lies in between 95% Confidence interval and P-value represents the statistical significant
955169 / Pixabay
Example: UCLA Graduate School Admission dataset
calculate both theoretical and true Odds Ratio and interpret the meaning of odds ratio
<script src="https://gist.github.com/fangya18/2cdf0ae21856edbaca0c1d3d0aefd501.js"></script>
admit gre gpa prestige 0 0 380 3.61 3 1 1 660 3.67 3 2 1 800 4.00 1 3 1 640 3.19 4 4 0 520 2.93 4
#1 is the most prestiges school.
# we make a dummy_rank to group prestige 1,2 as 1 and 3,4 as 2
df["dummy_rank"]=np.where(df["prestige"] <3 , 1 ,2)
df.hist()
pl.show()
#dummy_rank=pd.get_dummies(df["prestige"],prefix="prestige")
print (df.head())
#frequncy table prestiges vs admit
print(pd.crosstab(df['admit'],df["dummy_rank"]))
admit gre gpa prestige dummy_rank 0 0 380 3.61 3 2 1 1 660 3.67 3 2 2 1 800 4.00 1 1 3 1 640 3.19 4 2 4 0 520 2.93 4 2 dummy_rank 1 2 admit 0 125 148 1 87 40
#Apply logistic regression
X=df[["gre","gpa","dummy_rank"]]
logit=sm.Logit(df["admit"],X)
result=logit.fit()
print (result.summary())
print (result.conf_int())
Optimization terminated successfully. Current function value: 0.593637 Iterations 5 Logit Regression Results ============================================================================== Dep. Variable: admit No. Observations: 400 Model: Logit Df Residuals: 397 Method: MLE Df Model: 2 Date: Fri, 19 Oct 2018 Pseudo R-squ.: 0.05014 Time: 17:44:14 Log-Likelihood: -237.45 converged: True LL-Null: -249.99 LLR p-value: 3.604e-06 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ gre 0.0014 0.001 1.318 0.188 -0.001 0.003 gpa 0.0247 0.204 0.121 0.904 -0.375 0.425 dummy_rank -1.1395 0.222 -5.144 0.000 -1.574 -0.705 ============================================================================== 0 1 gre -0.000660 0.003368 gpa -0.375392 0.424737 dummy_rank -1.573685 -0.705355
# Theoratical odds ratio
print(np.exp(result.params))
params= result.params
conf=result.conf_int()
conf["OR"]=params
conf.columns=["2.5%","97.5%","OR"]
print(np.exp(conf))
gre 1.001355 gpa 1.024980 dummy_rank 0.319973 dtype: float64 2.5% 97.5% OR gre 0.99934 1.003374 1.001355 gpa 0.68702 1.529189 1.024980 dummy_rank 0.20728 0.493933 0.319973
# Calculate Probality vs Odds vs Odds ratio
prob_rank1_accept=87/(125+87)
print(prob_rank1_accept)
prob_rank2_accept=40/(148+40)
print(prob_rank2_accept)
odds_rank1=87/125
odds_rank2=40/148
print(odds_rank1, odds_rank2)
odds_ratio=odds_rank2/odds_rank1
print(odds_ratio)
0.41037735849056606 0.2127659574468085 0.696 0.2702702702702703 0.38831935383659527
#Visulatization
%matplotlib inline
pd.crosstab(df.admit, df.dummy_rank).plot(kind="bar")
plt.title("Admit vs Prestige")
plt.xlabel("Admit")
plt.ylabel("Student Frequency Count")
Summary
Our theoretical Odds Ratio is 0.319 with a CI(0.20, 0.41), which is close to the true Odds ratio, 0.388. This indicates if the undergraduate students are from the school in prestige 3 or 4, the chances of them getting in graduate school is 38% that of the students from prestige 1 or 2 undergraduate schools. In other words, it is 2.5 times more likely for a student to get into a graduate school from undergraduate school rated in Prestige 1 or 2 compared to 3 or 4. Our graph supported the result!
Inspired by http://blog.yhat.com/posts/logistic-regression-and-python.html
Happy Studying!