作为 PC 算法的一部分，在 python 中测试条件独立性

作为 PC 算法的一部分，在 python 中测试条件独立性

我正在python中实现PC算法。这种算法构建了一个 n 变量高斯分布的图形模型。这个图形模型基本上是有向无环图的骨架，这意味着如果一个结构像：

(x1)---(x2)---(x3)

在图中，则 x1 与给定 x2 的 x3 无关。更一般地，如果 A 是图的邻接矩阵且 A(i,j)=A(j,i) = 0（i 和 j 之间缺少边），则 i 和 j 条件独立，所有变量出现在从 i 到 j 的任何路径中。出于统计和机器学习的目的，可以“学习”底层图形模型。如果我们对联合高斯 n 变量随机变量有足够的观察，我们可以使用 PC 算法，其工作原理如下：

given n as the number of variables observed, initialize the graph as G=K(n)

for each pair i,j of nodes:

if exists an edge e from i to j:

look for the neighbours of i

if j is in neighbours of i then remove j from the set of neighbours

call the set of neighbours k

TEST if i and j are independent given the set k, if TRUE:

remove the edge e from i to j

该算法还计算图的分离集，该分离集由另一种算法使用，该算法构建从骨架开始的 dag 和由 pc 算法返回的分离集。这是我到目前为止所做的：

def _core_pc_algorithm(a,sigma_inverse):

l = 0

N = len(sigma_inverse[0])

n = range(N)

sep_set = [ [set() for i in n] for j in n]

act_g = complete(N)

z = lambda m,i,j : -m[i][j]/((m[i][i]*m[j][j])**0.5)

while l<N:

for (i,j) in itertools.permutations(n,2):

adjacents_of_i = adj(i,act_g)

if j not in adjacents_of_i:

continue

else:

adjacents_of_i.remove(j)

if len(adjacents_of_i) >=l:

for k in itertools.combinations(adjacents_of_i,l):

if N-len(k)-3 < 0:

return (act_g,sep_set)

if test(sigma_inverse,z,i,j,l,a,k):

act_g[i][j] = 0

act_g[j][i] = 0

sep_set[i][j] |= set(k)

sep_set[j][i] |= set(k)

l = l + 1

此函数实现了一个统计测试，该测试使用了估计偏相关的 Fisher z 变换。我以两种方式使用这个算法：

从线性回归模型生成数据并将学习到的 DAG 与预期的进行比较
读取数据集并学习底层 DAG

在这两种情况下，我并不总是得到正确的结果，要么是因为我知道某个数据集背后的 DAG，要么是因为我知道生成模型但它与我的算法学习的模型不一致。我完全知道这是一项重要的任务，即使在我在这里省略的部分代码中，我也可能误解了理论概念以及犯了错误；但首先我想知道（从比我更有经验的人那里），如果我写的测试是正确的，并且是否有执行此类测试的库函数，我尝试搜索但我找不到任何合适的功能。

蛊毒传说

浏览 312回答 1

1回答

交互式爱情

我切入正题。上面代码中最关键的问题，关于以下错误：sqrt(n-len(k)-3)*abs(z(sigma_inverse[i][j])) <= phi(1-alpha/2)我误解了 n 的平均值，它不是精度矩阵的大小，而是多变量观察的总数（在我的情况下，是 10000 而不是 5）。另一个错误的假设是 z(sigma_inverse[i][j]) 必须提供 i 和 j 的部分相关性，给定所有其余部分。这是不正确的，z 是精度矩阵的适当子集上的 Fisher 变换，它估计给定 K 时 i 和 j 的偏相关。正确的测试如下：if len(K) == 0: #CM is the correlation matrix, we have no variables conditioning (K has 0 length)    r = CM[i, j] #r is the partial correlation of i and j elif len(K) == 1: #we have one variable conditioning, not very different from the previous version except for the fact that i have not to compute the correlations matrix since i start from it, and pandas provide such a feature on a DataFrame    r = (CM[i, j] - CM[i, K] * CM[j, K]) / math.sqrt((1 - math.pow(CM[j, K], 2)) * (1 - math.pow(CM[i, K], 2))) #r is the partial correlation of i and j given Kelse: #more than one conditioning variable    CM_SUBSET = CM[np.ix_([i]+[j]+K, [i]+[j]+K)] #subset of the correlation matrix i'm looking for    PM_SUBSET = np.linalg.pinv(CM_SUBSET) #constructing the precision matrix of the given subset    r = -1 * PM_SUBSET[0, 1] / math.sqrt(abs(PM_SUBSET[0, 0] * PM_SUBSET[1, 1]))r = min(0.999999, max(-0.999999,r)) res = math.sqrt(n - len(K) - 3) * 0.5 * math.log1p((2*r)/(1-r)) #estimating partial correlation with fisher's transofrmationreturn 2 * (1 - norm.cdf(abs(res))) #obtaining p-value我希望有人能发现这有帮助

0

0

随时随地看视频慕课网APP

相关分类

Python