为什么文本文件行中的“\x01\x1A”（头开始和替换控制字符）会过早地停止 for 循环？

首页课程实战体系课手记专栏慕课教程

我使用的是Python 2.7.15、Windows 7

语境

我编写了一个脚本来读取和标记 FileZilla 日志文件（此处为规范）的每一行，以获取发起与 FileZilla 服务器的连接的主机的 IP 地址。我在解析字符后面的log text字段时遇到问题>。我写的脚本使用：

with open('fz.log','r') as rh:

for lineno, line in rh:

pass

构造读取每一行。该 for 循环在遇到log text包含SOH和SUB字符的字段时过早停止。我无法向您展示日志文件，因为它包含敏感信息，但可以通过读取包含这些字符的文本文件来重现问题的症结所在。

我的目标是提取 IP 地址（我可以使用re.search()），但在此之前，我必须删除这些控制字符。为此，我创建了一个日志文件的副本，其中删除了包含这些控制字符的行。可能有更好的方法，但我更好奇为什么 for 循环在遇到控制字符后会停止。

重现问题

我用这段代码重现了这个问题：

if __name__ == '__main__':

fn = 'writetest.txt'

fn2 = 'writetest_NoControlChars.txt'

# Create the problematic textfile

with open(fn, 'w') as wh:

wh.write("This line comes first!\n");

wh.write("Blah\x01\x1A\n"); # Write Start-of-Header and Subsitute unicode character to line

wh.write("This comes after!")

# Try to read the file above, removing the SOH/SUB characters if encountered

with open(fn, 'r') as rh:

with open(fn2, 'w') as wh:

for lineno, line in enumerate(rh):

sline = line.translate(None,'\x01\x1A')

wh.write(sline)

print "Line #{}: {}".format(lineno, sline)

print "Program executed."

输出

上面的代码创建了 2 个输出文件，并在控制台窗口中生成以下内容：

Line #0: This line comes first!

Line #1: Blah

Program executed.

我逐步调试了 Eclipse 中的代码，并在执行

for lineno, line in enumerate(rh):

语句, rh, 该打开文件的句柄已关闭。我原以为它会移动到第三行，打印This comes after!到控制台并将其写出，writetest_NoControlChars.txt但没有发生任何事件。相反，执行跳转到print "Program executed".

慕侠2389804

浏览 619回答 1

www说

如果您知道它包含非文本数据，则必须以二进制模式打开此文件： open(fn, 'rb')

0 0

随时随地看视频慕课网APP