一、编程环境
Win10
Python3.6
Jupyter Notebook
Graphviz (简介和安装请参考https://www.jianshu.com/p/b559dc689b7f)
二、数据源
http://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html
把这个网址里的数据拷贝到csv文件中,并命名为dataset_uncleaned.csv
三、清洗数据
1 将疾病和对应的多个症状放到字典里,key为疾病,value为多个症状。
注意,有些疾病和症状包含了特殊符号’^’,需要先处理成’_’再切割。
import csvfrom collections import defaultdict disease_list = []def return_list(disease): disease_list = [] match = disease.replace('^','_').split('_') ctr = 1 for group in match: if ctr%2==0: disease_list.append(group) ctr = ctr + 1 return disease_listwith open("Scraped-Data/dataset_uncleaned.csv") as csvfile: reader = csv.reader(csvfile) disease="" weight = 0 disease_list = [] dict_wt = {} dict_=defaultdict(list) for row in reader: if row[0]!="\xc2\xa0" and row[0]!="": disease = row[0] disease_list = return_list(disease) weight = row[1] if row[2]!="\xc2\xa0" and row[2]!="": symptom_list = return_list(row[2]) for d in disease_list: for s in symptom_list: dict_[d].append(s) dict_wt[d] = weight print (dict_)
2 将疾病-症状-样本数写到dataset_clean.csv中,注意,每个疾病对应着一个样本数和多个症状。
with open("Scraped-Data/dataset_clean.csv","w") as csvfile: writer = csv.writer(csvfile) for key,values in dict_.items(): for v in values: #key = str.encode(key) key = str.encode(key).decode('utf-8') #.strip() #v = v.encode('utf-8').strip() #v = str.encode(v) writer.writerow([key,v,dict_wt[key]])
注意,此时看到的csv中,每行数据下面有一行空行,这个先不用处理,下面的步骤会处理。
3 给数据表dataset_clean.csv中的每列数据加上列标题
columns = ['Source','Target','Weight'] data = pd.read_csv("Scraped-Data/dataset_clean.csv",names=columns, encoding ="ISO-8859-1") data.head() data.to_csv("Scraped-Data/dataset_clean.csv",index=False)
此时,每行下面的空行消失了。
4 标注数据并存到nodetable.csv中
数据分为三列,第一列ID是疾病名称或症状名称;第二列Label是疾病名称或症状名称,与ID完全一样;第三标属性标明了这个ID或Label是病症或症状。
slist = [] dlist = []with open("Scraped-Data/nodetable.csv","w") as csvfile: writer = csv.writer(csvfile) for key,values in dict_.items(): for v in values: if v not in slist: writer.writerow([v,v,"symptom"]) slist.append(v) if key not in dlist: writer.writerow([key,key,"disease"]) dlist.append(key) nt_columns = ['Id','Label','Attribute'] nt_data = pd.read_csv("Scraped-Data/nodetable.csv",names=nt_columns, encoding ="ISO-8859-1",) nt_data.head() nt_data.to_csv("Scraped-Data/nodetable.csv",index=False)
四、分析清洗好的数据
data = pd.read_csv("Scraped-Data/dataset_clean.csv", encoding ="ISO-8859-1") len(data['Source'].unique()) len(data['Target'].unique()) df = pd.DataFrame(data) df_1 = pd.get_dummies(df.Target) df_1 df df_s = df['Source'] df_pivoted = pd.concat([df_s,df_1], axis=1) df_pivoted.drop_duplicates(keep='first',inplace=True)df_pivotedlen(df_pivoted)cols = df_pivoted.columnsprint(cols)df_pivoted = df_pivoted.groupby('Source').sum() df_pivoted = df_pivoted.reset_index()df_pivotedlen(df_pivoted)df_pivoted.to_csv("Scraped-Data/df_pivoted.csv")
这此代码主要是分析数据,比如疾病有多少种,症状有多少种。每种疾病对应的症状标记为1,没对应上的症状标记为0,将这些数据合并后存到df_pivoted.csv中。
五、用朴素贝叶斯来训练模型
x = df_pivoted[cols] y = df_pivoted['Source']import pandas as pdimport seaborn as snsimport matplotlib.pyplot as plt %matplotlib inlinefrom sklearn.naive_bayes import MultinomialNBfrom sklearn.cross_validation import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42) mnb = MultinomialNB() mnb = mnb.fit(x_train, y_train) mnb.score(x_test, y_test)
得分为0,意味着没有预测能力。
这是因为,对于149条数据(对应着149种疾病),被预测的那1/3的疾病是没有见过的,所以算法没有办法对没见过的疾病进行预测。
改为用全部的数据进行训练,并用全部的数据进行预测
mnb_tot = MultinomialNB() mnb_tot = mnb_tot.fit(x, y) mnb_tot.score(x, y)
得分率为0.8993288590604027
打印出预测不准确的疾病
disease_pred = mnb_tot.predict(x) disease_real = y.valuesfor i in range(0, len(disease_real)): if disease_pred[i]!=disease_real[i]: print ('Pred: {0} Actual:{1}'.format(disease_pred[i].ljust(30), disease_real[i]))
运行结果:
Pred: HIV Actual:acquired immuno-deficiency syndromePred: biliary calculus Actual:cholelithiasisPred: coronary arteriosclerosis Actual:coronary heart diseasePred: depression mental Actual:depressive disorderPred: HIV Actual:hiv infectionsPred: carcinoma breast Actual:malignant neoplasm of breastPred: carcinoma of lung Actual:malignant neoplasm of lungPred: carcinoma prostate Actual:malignant neoplasm of prostatePred: carcinoma colon Actual:malignant tumor of colonPred: candidiasis Actual:oralcandidiasisPred: effusion pericardial Actual:pericardial effusion body substancePred: malignant neoplasms Actual:primary malignant neoplasmPred: sepsis (invertebrate) Actual:septicemiaPred: sepsis (invertebrate) Actual:systemic infectionPred: tonic-clonic epilepsy Actual:tonic-clonic seizures
六、用决策树来训练模型
from sklearn.tree import DecisionTreeClassifier, export_graphviz dt = DecisionTreeClassifier() clf_dt=dt.fit(x,y)print ("Acurracy: ", clf_dt.score(x,y))
得到的分数为0.8993288590604027,这与上面用朴素贝叶斯算法得到的结果一样。
下面要可视化决策树的节点分布
1 生成tree.dot
from sklearn import tree from sklearn.tree import export_graphviz export_graphviz(dt, out_file='DOT-files/tree.dot', feature_names=cols)
在工程目录下的DOT-files目录下,可以看到生成了tree.dot文件。
打开cmd终端,进入到tree.dot所在的目录,即DOT-files/中,执行
dot -Tpng tree.dot -o ..\tree.png
会得到tree.png
但是如果tree.dot太大的话,有可能报内存不够的错误:
dot: failure to create cairo surface: out of memory
2 在jupyter notebook中显示tree.png
from IPython.display import Image Image(filename='tree.png')
七、版权声明
程序来源于https://github.com/Aniruddha-Tapas/Predicting-Diseases-From-Symptoms
笔者在这里只是学习、分析、记录,版权属于原作者。
作者:海天一树X
链接:https://www.jianshu.com/p/882ee4db4e40