我有这个数据框:
manufacturer description
0 toyota toyota, gmc 10 years old.
1 NaN gmc, Motor runs and drives good.
2 NaN Motor old, in pieces.
3 NaN 2 owner 0 rust. Cadillac.
我想用从描述中获取的关键字填充 NaN 值。为此,我创建了一个包含我想要的关键字的列表:
keyword = ['gmc', 'toyota', 'cadillac']
最后,我想循环 DataFrame 中的每一行。将内容从每行的“描述”列中拆分出来,如果该单词也在“关键字”列表中,则将其添加到“制造商”列中。例如,它看起来像这样:
manufacturer description
0 toyota toyota, gmc 10 years old.
1 gmc gmc, Motor runs and drives good.
2 NaN Motor old, in pieces.
3 cadillac 2 owner 0 rust. Cadillac.
感谢这个社区中的一位友好的人,我可以将我的代码改进为:
import re
keyword = ['gmc', 'toyota', 'cadillac']
bag_of_words = []
for i, description in enumerate(test3['description']):
bag_of_words = re.findall(r"""[A-Za-z\-]+""", test3["description"][i])
for word in bag_of_words:
if word.lower() in keyword:
test3.loc[i, 'manufacturer'] = word.lower()
但我意识到第一行也改变了值,即使它不是 NaN:
manufacturer description
0 gmc toyota, gmc 10 years old.
1 gmc gmc, Motor runs and drives good.
2 NaN Motor old, in pieces.
3 cadillac 2 owner 0 rust. Cadillac.
我只想更改 NaN 值,但是当我尝试添加时:
if word.lower() in keyword and test3.loc[i, 'manufacturer'] == np.nan:
它没有任何效果。
智慧大石
相关分类