如何使用 Python 删除 JavaScript 和其他标签...而不导入模块

对于学校项目的第一部分,我试图弄清楚如何删除 JavaScript<script {...} >和</script {...} >标签以及<和之间的任何内容>。


然而,我们无法导入任何模块(甚至是Python内置的模块),因为显然标记可能无法访问它们等等。


我试过这个:


text = "<script beep beep> hello </script boop doop woop> hello <hi> hey <bye>"

while text.find("<script") >= 0:

    script_start = text.find("<script")

    script_end = text.find(">", text.find("</script")) + 1

    text = text[:script_start] + text[script_end:]


while text.find("<") >= 0:

    script2_start = text.find("<")

    script2_end = text.find(">") + 1

    text = text[:script2_start] + text[script2_end:]

这确实适用于较小的文件,但该项目与大文本文件有关(我们给出的简化测试文件是 10.4MB),因此它不会完成并且会卡住。


有人有任何想法可以提高效率吗?


神不在的星期二
浏览 112回答 3
3回答

大话西游666

您不需要删除任何内容。事实上:你永远不想修改字符串。字符串是不可变的:每次“修改”字符串时,您都会创建一个新字符串并丢弃旧字符串。这是对处理器和内存的浪费。您正在对文件进行操作 - 因此请按字符方式处理它:记住你是否在<...>里面如果是这样,唯一重要的特征就是&nbsp;>再次出去如果外面和字符是<你进入里面并忽略该字符如果在外部而不是在外部,<则将字符写入输出(-file)# create filewith open("somefile.txt","w") as f:&nbsp; &nbsp; # up the multiplicator to 10000000 to create something in the megabyte range&nbsp; &nbsp; f.write("<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata\n"*10)# open file to read from and file to write towith open("somefile.txt") as f, open("otherfile.txt","w") as out:&nbsp; &nbsp; # starting outside&nbsp; &nbsp; inside = False&nbsp; &nbsp; # we iterate the file line by line&nbsp; &nbsp; for line in f:&nbsp; &nbsp; &nbsp; &nbsp; # and each line characterwise&nbsp; &nbsp; &nbsp; &nbsp; for c in line:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if not inside and c == "<":&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; inside = True&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; elif inside and c != ">":&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; continue&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; elif inside and c == ">":&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; inside = False&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; elif not inside:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; # only case to write to out&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; out.write(c)print(open("somefile.txt").read() + "\n")print(open("otherfile.txt").read())输出:<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata&nbsp;hello&nbsp; hello&nbsp; hey&nbsp; tata&nbsp;hello&nbsp; hello&nbsp; hey&nbsp; tata&nbsp;hello&nbsp; hello&nbsp; hey&nbsp; tata&nbsp;hello&nbsp; hello&nbsp; hey&nbsp; tata&nbsp;hello&nbsp; hello&nbsp; hey&nbsp; tata&nbsp;hello&nbsp; hello&nbsp; hey&nbsp; tata&nbsp;hello&nbsp; hello&nbsp; hey&nbsp; tata&nbsp;hello&nbsp; hello&nbsp; hey&nbsp; tata&nbsp;hello&nbsp; hello&nbsp; hey&nbsp; tata&nbsp;hello&nbsp; hello&nbsp; hey&nbsp; tata如果不允许直接操作文件,请将文件读入消耗 11+Mbyte 内存的列表中:data = list("<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata\n" * 10)result = []inside = Falsefor c in data:&nbsp; &nbsp; if inside:&nbsp; &nbsp; &nbsp; &nbsp; if c == ">":&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; inside = False&nbsp; &nbsp; &nbsp; &nbsp; # else ignore c - because we are inside&nbsp; &nbsp; elif c == "<":&nbsp; &nbsp; &nbsp; &nbsp; inside = True&nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; result.append(c)print(''.join(result))这仍然比迭代搜索列表中第一次出现的“<”更好,但可能需要最多两倍的源内存(如果它不包含任何 <..>,则将列表加倍)。操作文件比进行任何就地列表修改(这将是第三种方法)的内存效率要高得多。您还需要解决一些明显的问题,例如<script type="text/javascript">var i = 10;if (i < 5) {&nbsp; // some code}</script>会将“代码”留在里面。这可能会解决更简单的极端情况:# open file to read from and file to write towith open("somefile.txt") as f, open("otherfile.txt","w") as out:&nbsp; &nbsp; # starting outside&nbsp; &nbsp; inside = False&nbsp; &nbsp; insideJS = False&nbsp; &nbsp; jsStart = 0&nbsp; &nbsp; # we iterate the file line by line&nbsp; &nbsp; for line in f:&nbsp; &nbsp; &nbsp; &nbsp; # string manipulation :/ - will remove <script ...> .. </script ..>&nbsp; &nbsp; &nbsp; &nbsp; # even over multiple lines - probably missed some cornercases.&nbsp; &nbsp; &nbsp; &nbsp; while True:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if insideJS and not "</script" in line:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; line = ""&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; break&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if "<script" in line:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; insideJS = True&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; jsStart = line.index("<script")&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; jsEnd = len(line)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; elif insideJS:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; jsStart = 0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if not insideJS:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; break&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if "</script" in line:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; jsEnd = line.index(">", line.index("</script", jsStart))+1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; line = line[:jsStart] + line[jsEnd:]&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; insideJS = False&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; line = line[:jsStart]&nbsp; &nbsp; &nbsp; &nbsp; # and each line characterwise&nbsp; &nbsp; &nbsp; &nbsp; for c in line:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; # ... same as above ...

偶然的你

即使有2个while循环,它仍然是线性复杂度string = "<script beep beep> hello </script boop doop woop> hello <hi> hey <bye>"new_string = ''i = 0while i < len(string):&nbsp; &nbsp; if string[i] == "<":&nbsp; &nbsp; &nbsp; &nbsp; while i < len(string):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; i += 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if string[i] == '>':&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; break&nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; new_string += string[i]&nbsp; &nbsp; i += 1print(new_string)输出:&nbsp;hello&nbsp; hello&nbsp; hey&nbsp;

呼唤远方

以下是FSA的一种方法:output = ''NORMAL, INSIDE_TAG = range(2) # availale statesstate = NORMAL # start with normal states = '<script beep beep> hello </script boop doop woop> hello <hi id="someid" class="some class"><a> hey </a><bye>'for char in s:  if char == '<': # if we encounter '<' we enter the INSIDE_TAG state    state = INSIDE_TAG    continue  elif char == '>': # we can safely exit the INSIDE_TAG state    state = NORMAL    continue  if state == NORMAL:    output += char  # add the char to the output only if we are in normal stateprint(output)如果需要解析标签语义,请确保使用堆栈(可以使用 实现list)。这会增加复杂性,但您可以使用 FSM 实现可靠的检查。请参见以下示例:output = ''(  NORMAL,  TAG_ATTRIBUTE,  INSIDE_JAVASCRIPT,  EXITING_TAG,  BEFORE_TAG_OPENING_OR_ENDING,  TAG_NAME,  ABOUT_TO_EXIT_JS) = range(7) # availale statesstate = NORMAL # start with normal statetag_name = ''s = """<script type="text/javascript">  var i = 10;  if (i < 5) {    // some code  }</script><sometag>  test string  <a href="http://google.com"> another string</a></sometag>"""for char in s:  # print(char, '-', state, ':', tag_name)  if state == NORMAL:    if char == '<':      state = BEFORE_TAG_OPENING_OR_ENDING    else:      output += char  elif state == BEFORE_TAG_OPENING_OR_ENDING:    if char == '/':      state = EXITING_TAG    else:      tag_name += char      state = TAG_NAME  elif state == TAG_ATTRIBUTE:    if char == '>':      if tag_name == 'script':        state = INSIDE_JAVASCRIPT      else:        state = NORMAL  elif state == TAG_NAME:    if char == ' ':      state = TAG_ATTRIBUTE    elif char == '>':      if tag_name == 'script':        state = INSIDE_JAVASCRIPT      else:        state = NORMAL    else:      tag_name += char  elif state == INSIDE_JAVASCRIPT:    if char == '<':      state = ABOUT_TO_EXIT_JS    else:      pass      # output += char  elif state == ABOUT_TO_EXIT_JS:    if char == '/':      state = EXITING_TAG      tag_name = ''    else:      # output += '<'      state = INSIDE_JAVASCRIPT  elif state == EXITING_TAG:    if char == '>':      state = NORMALprint(output)输出:  test string  another string
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python