手记

Odds Ratio

Data Science Day 12: Odds Ratio


Learning Objective: 

  • Probability vs Odds Vs Odds Ratio

1. Probability = Event/Sample Space
2. Odds= Prob(Event)/Prob(Non-Event)
3. Odds Ratio = Odds(Group 1)/ Odds(Group 2)

  • Interpretation

The Odds Ratio is a measure of association between exposure and outcome.

OR=Odds(Group 1)/Odds(Group2)>1 indicates the increased occurrence of an event in Group 1 compared to Group 2.

OR=Odds(Group 1)/Odds(Group2) < 1 indicates the decreased occurrence of an event in Group 1 compared to Group 2.

The true Odds Ratio lies in between 95% Confidence interval and P-value represents the statistical significant

955169 / Pixabay

  • Example: UCLA Graduate School Admission dataset

  1. calculate both theoretical and true Odds Ratio and interpret the meaning of odds ratio

    <script src="https://gist.github.com/fangya18/2cdf0ae21856edbaca0c1d3d0aefd501.js"></script>

   admit  gre   gpa  prestige
0      0  380  3.61         3
1      1  660  3.67         3
2      1  800  4.00         1
3      1  640  3.19         4
4      0  520  2.93         4
  1. #1 is the most prestiges school.

  2. # we make a dummy_rank to group prestige 1,2 as 1 and 3,4 as 2

  3. df["dummy_rank"]=np.where(df["prestige"] <3 , 1 ,2)  

  4. df.hist()

  5. pl.show()

  6. #dummy_rank=pd.get_dummies(df["prestige"],prefix="prestige")

  7. print (df.head())


  8. #frequncy table prestiges vs admit

  9. print(pd.crosstab(df['admit'],df["dummy_rank"]))

 

   admit  gre   gpa  prestige  dummy_rank
0      0  380  3.61         3           2
1      1  660  3.67         3           2
2      1  800  4.00         1           1
3      1  640  3.19         4           2
4      0  520  2.93         4           2
dummy_rank    1    2
admit               
0           125  148
1            87   40

 

  1. #Apply logistic regression

  2. X=df[["gre","gpa","dummy_rank"]]


  3. logit=sm.Logit(df["admit"],X)

  4. result=logit.fit()

  5. print (result.summary())

  6. print (result.conf_int())

Optimization terminated successfully.
         Current function value: 0.593637
         Iterations 5
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                  admit   No. Observations:                  400
Model:                          Logit   Df Residuals:                      397
Method:                           MLE   Df Model:                            2
Date:                Fri, 19 Oct 2018   Pseudo R-squ.:                 0.05014
Time:                        17:44:14   Log-Likelihood:                -237.45
converged:                       True   LL-Null:                       -249.99
                                        LLR p-value:                 3.604e-06
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
gre            0.0014      0.001      1.318      0.188      -0.001       0.003
gpa            0.0247      0.204      0.121      0.904      -0.375       0.425
dummy_rank    -1.1395      0.222     -5.144      0.000      -1.574      -0.705
==============================================================================
                   0         1
gre        -0.000660  0.003368
gpa        -0.375392  0.424737
dummy_rank -1.573685 -0.705355
  1. # Theoratical odds ratio

  2. print(np.exp(result.params))


  3. params= result.params

  4. conf=result.conf_int()

  5. conf["OR"]=params

  6. conf.columns=["2.5%","97.5%","OR"]

  7. print(np.exp(conf))

gre           1.001355
gpa           1.024980
dummy_rank    0.319973
dtype: float64
               2.5%     97.5%        OR
gre         0.99934  1.003374  1.001355
gpa         0.68702  1.529189  1.024980
dummy_rank  0.20728  0.493933  0.319973
  1. # Calculate Probality vs Odds vs Odds ratio


  2. prob_rank1_accept=87/(125+87)

  3. print(prob_rank1_accept)


  4. prob_rank2_accept=40/(148+40)

  5. print(prob_rank2_accept)


  6. odds_rank1=87/125

  7. odds_rank2=40/148

  8. print(odds_rank1, odds_rank2)


  9. odds_ratio=odds_rank2/odds_rank1

  10. print(odds_ratio)

0.41037735849056606
0.2127659574468085
0.696 0.2702702702702703
0.38831935383659527
  1. #Visulatization


  2. %matplotlib inline

  3. pd.crosstab(df.admit, df.dummy_rank).plot(kind="bar")

  4. plt.title("Admit vs Prestige")

  5. plt.xlabel("Admit")

  6. plt.ylabel("Student Frequency Count")

Summary

Our theoretical Odds Ratio is 0.319 with a CI(0.20, 0.41), which is close to the true Odds ratio0.388. This indicates if the undergraduate students are from the school in prestige 3 or 4, the chances of them getting in graduate school is 38% that of the students from prestige 1 or 2 undergraduate schools. In other words, it is 2.5 times more likely for a student to get into a graduate school from undergraduate school rated in Prestige 1 or 2 compared to 3 or 4. Our graph supported the result!

Inspired by http://blog.yhat.com/posts/logistic-regression-and-python.html

 

Happy Studying! 


0人推荐
随时随地看视频
慕课网APP