Odds Ratio-原创手记-慕课网

Data Science Day 12: Odds Ratio

Learning Objective:

Probability vs Odds Vs Odds Ratio

1. Probability = Event/Sample Space
2. Odds= Prob(Event)/Prob(Non-Event)
3. Odds Ratio = Odds(Group 1)/ Odds(Group 2)

Interpretation

The Odds Ratio is a measure of association between exposure and outcome.

OR=Odds(Group 1)/Odds(Group2)>1 indicates the increased occurrence of an event in Group 1 compared to Group 2.

OR=Odds(Group 1)/Odds(Group2) < 1 indicates the decreased occurrence of an event in Group 1 compared to Group 2.

The true Odds Ratio lies in between 95% Confidence interval and P-value represents the statistical significant

955169 / Pixabay

Example: UCLA Graduate School Admission dataset

calculate both theoretical and true Odds Ratio and interpret the meaning of odds ratio
<script src="https://gist.github.com/fangya18/2cdf0ae21856edbaca0c1d3d0aefd501.js"></script>

   admit  gre   gpa  prestige
0      0  380  3.61         3
1      1  660  3.67         3
2      1  800  4.00         1
3      1  640  3.19         4
4      0  520  2.93         4

#1 is the most prestiges school.
# we make a dummy_rank to group prestige 1,2 as 1 and 3,4 as 2
df["dummy_rank"]=np.where(df["prestige"] <3 , 1 ,2)
df.hist()
pl.show()
#dummy_rank=pd.get_dummies(df["prestige"],prefix="prestige")
print (df.head())
#frequncy table prestiges vs admit
print(pd.crosstab(df['admit'],df["dummy_rank"]))

   admit  gre   gpa  prestige  dummy_rank
0      0  380  3.61         3           2
1      1  660  3.67         3           2
2      1  800  4.00         1           1
3      1  640  3.19         4           2
4      0  520  2.93         4           2
dummy_rank    1    2
admit               
0           125  148
1            87   40

#Apply logistic regression
X=df[["gre","gpa","dummy_rank"]]
logit=sm.Logit(df["admit"],X)
result=logit.fit()
print (result.summary())
print (result.conf_int())

Optimization terminated successfully.
         Current function value: 0.593637
         Iterations 5
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                  admit   No. Observations:                  400
Model:                          Logit   Df Residuals:                      397
Method:                           MLE   Df Model:                            2
Date:                Fri, 19 Oct 2018   Pseudo R-squ.:                 0.05014
Time:                        17:44:14   Log-Likelihood:                -237.45
converged:                       True   LL-Null:                       -249.99
                                        LLR p-value:                 3.604e-06
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
gre            0.0014      0.001      1.318      0.188      -0.001       0.003
gpa            0.0247      0.204      0.121      0.904      -0.375       0.425
dummy_rank    -1.1395      0.222     -5.144      0.000      -1.574      -0.705
==============================================================================
                   0         1
gre        -0.000660  0.003368
gpa        -0.375392  0.424737
dummy_rank -1.573685 -0.705355

# Theoratical odds ratio
print(np.exp(result.params))
params= result.params
conf=result.conf_int()
conf["OR"]=params
conf.columns=["2.5%","97.5%","OR"]
print(np.exp(conf))

gre           1.001355
gpa           1.024980
dummy_rank    0.319973
dtype: float64
               2.5%     97.5%        OR
gre         0.99934  1.003374  1.001355
gpa         0.68702  1.529189  1.024980
dummy_rank  0.20728  0.493933  0.319973

# Calculate Probality vs Odds vs Odds ratio
prob_rank1_accept=87/(125+87)
print(prob_rank1_accept)
prob_rank2_accept=40/(148+40)
print(prob_rank2_accept)
odds_rank1=87/125
odds_rank2=40/148
print(odds_rank1, odds_rank2)
odds_ratio=odds_rank2/odds_rank1
print(odds_ratio)

0.41037735849056606
0.2127659574468085
0.696 0.2702702702702703
0.38831935383659527

#Visulatization
%matplotlib inline
pd.crosstab(df.admit, df.dummy_rank).plot(kind="bar")
plt.title("Admit vs Prestige")
plt.xlabel("Admit")
plt.ylabel("Student Frequency Count")

Summary

Our theoretical Odds Ratio is 0.319 with a CI(0.20, 0.41), which is close to the true Odds ratio, 0.388. This indicates if the undergraduate students are from the school in prestige 3 or 4, the chances of them getting in graduate school is 38% that of the students from prestige 1 or 2 undergraduate schools. In other words, it is 2.5 times more likely for a student to get into a graduate school from undergraduate school rated in Prestige 1 or 2 compared to 3 or 4. Our graph supported the result!

Inspired by http://blog.yhat.com/posts/logistic-regression-and-python.html

Happy Studying!