具有可变长度multiindex的pandas数据框将值替换为NaN

我正在使用来自调查的相当复杂的数据集的熊猫表示形式。到目前为止，似乎具有多索引的一维变量系列最适合用于存储和处理此数据。

每个变量名称都由一个“路径”组成，以唯一地标识该特定响应。这些路径的长度是变化的。我试图弄清楚是否我误解了层次索引应该如何工作，或者是否遇到了错误。当熊猫将较短的索引连接到数据集时，似乎将其“填充”到最大长度，并在此过程中破坏了该值。

例如，此测试失败：

def test_dataframe_construction1(self):

case1 = pd.Series(True, pd.MultiIndex.from_tuples([

('a1', 'b1', 'c1'),

('a2', 'b2', 'c2', 'd1', 'e1'),

]))

case2 = pd.Series(True, pd.MultiIndex.from_tuples([

('a3', 'b3', 'c3'),

('a4', 'b4', 'c4', 'd2', 'e2'),

]))

df = pd.DataFrame({

'case1': case1,

'case2': case2

})

logger.debug(df)

self.assertEquals(df['case1'].loc['a1'].any(), True)

并打印此：

a1 b1 c1 nan nan NaN NaN

a2 b2 c2 d1 e1 True NaN

a3 b3 c3 nan nan NaN NaN

a4 b4 c4 d2 e2 NaN True

有趣的是，用空字符串而不是NaN填充“较短”的索引会导致我期望的行为：

def test_dataframe_construction2(self):

case1 = pd.Series(True, pd.MultiIndex.from_tuples([

('a1', 'b1', 'c1', '', ''),

('a2', 'b2', 'c2', 'd1', 'e1'),

]))

case2 = pd.Series(True, pd.MultiIndex.from_tuples([

('a3', 'b3', 'c3', '', ''),

('a4', 'b4', 'c4', 'd2', 'e2'),

]))

df = pd.DataFrame({

'case1': case1,

'case2': case2

})

logger.debug(df)

self.assertEquals(df['case1'].loc['a1'].any(), True)

并打印此：

case1 case2

a1 b1 c1 True NaN

a2 b2 c2 d1 e1 True NaN

a3 b3 c3 NaN True

a4 b4 c4 d2 e2 NaN True

我在这里想念什么？谢谢！

慕侠2389804

浏览 252回答 1

具有可变长度multiindex的pandas数据框将值替换为NaN

1回答