慕标琳琳
我意识到一种方法是构建一个 max_depth=1 的决策树。这将执行分裂成两片叶子。然后挑出杂质最高的叶子继续分裂,再次将决策树拟合到这个子集上,如此重复。为确保层次结构清晰可见,我重新标记了 leaf_ids,以便清楚地看到,当您在树上向上移动时,ID 值会下降。这是一个例子:import numpy as npfrom sklearn.tree import DecisionTreeClassifierimport pandas as pddef decision_tree_one_path(X, y=None, min_leaf_size=3): nobs = X.shape[0] # boolean vector to include observations in the newest split include = np.ones((nobs,), dtype=bool) # try to get leaves around min_leaf_size min_leaf_size = max(min_leaf_size, 1) # one-level DT splitter dtmodel = DecisionTreeClassifier(splitter="best", criterion="gini", max_depth=1, min_samples_split=int(np.round(2.05*min_leaf_size))) leaf_id = np.ones((nobs,), dtype='int64') iter = 0 if y is None: y = np.random.binomial(n=1, p=0.5, size=nobs) while nobs >= 2*min_leaf_size: dtmodel.fit(X=X.loc[include], y=y[include]) # give unique node id new_leaf_names = dtmodel.apply(X=X.loc[include]) impurities = dtmodel.tree_.impurity[1:] if len(impurities) == 0: # was not able to split while maintaining constraint break # make sure node that is not split gets the lower node_label 1 most_impure_node = np.argmax(impurities) if most_impure_node == 0: # i.e., label 1 # switch 1 and 2 labels above is_label_2 = new_leaf_names == 2 new_leaf_names[is_label_2] = 1 new_leaf_names[np.logical_not(is_label_2)] = 2 # rename leaves leaf_id[include] = iter + new_leaf_names will_be_split = new_leaf_names == 2 # ignore the other one tmp = np.ones((nobs,), dtype=bool) tmp[np.logical_not(will_be_split)] = False include[include] = tmp # now create new labels nobs = np.sum(will_be_split) iter = iter + 1 return leaf_idleaf_id 因此是按顺序观察的叶子 ID。因此,例如 leaf_id==1 是第一个被拆分成终端节点的观察结果。leaf_id==2 是下一个从生成 leaf_id==1 的拆分中拆分出来的终端节点,如下所示。因此有 k+1 个叶子。#0#|\#1 .# |\# 2 .#.......## |\ # k (k+1) 不过,我想知道是否有一种方法可以在 Python 中自动执行此操作。