猿问

删除明确定义了块的开头和结尾的句子块

我在用Python 3.6.8


我有一个文本文件,例如-


###

books 22 feb 2017 21 april 2018

books 22 feb 2017 21

22 feb 2017 21 april

feb 2017 21 april 2018

$$$

###

risk true stories people never thought they d dare share

risk true stories people never

true stories people never thought

stories people never thought they

people never thought they d

never thought they d dare

thought they d dare share

$$$

###

everyone hanging out without me mindy kaling non fiction

everyone hanging out without me

hanging out without me mindy

out without me mindy kaling

without me mindy kaling non

me mindy kaling non fiction

$$$

我们使用 -


for line_no, line in enumerate(books):

    tokens = line.split(" ")

    output = list(ngrams(tokens, 5))

    booksWithNGrams.append("###") #Adding start of block

    booksWithNGrams.append(books[line_no]) # Adding original line

    for x in output: # Adding n-grams

        booksWithNGrams.append(' '.join(x))

    booksWithNGrams.append("$$$") # Adding end of block

如您所见,一个带有 n-gram 的句子以 . 开头###和结尾$$$。因此,块的开始和结束是明确定义的。


给定一个句子,我想删除一个块。例如 - 如果我输入22 feb 2017 21 april,我想删除 -


###

books 22 feb 2017 21 april 2018

books 22 feb 2017 21

22 feb 2017 21 april

feb 2017 21 april 2018

$$$

我怎样才能做到这一点?


千巷猫影
浏览 132回答 1
1回答

catspeake

正如您所说,该块限制在#和$之间。我们可以将文本视为这些符号之间的数字序列。使用 finditer 指向块限制。    import re    starts =[]    starts = [s.start() for s in re.finditer('###',text)]    # [0, 105, 349]              ends = []              ends   = [e.end() for e in re.finditer(re.escape('$$$'),text)] #special char $    # [104, 348, 558]    blocks = []    blocks = list(starts+ends)    blocks.sort()    #sequence of blocks    nBlocks = [blocks[i:i+2] for i in range(0, len(blocks), 2)]    #[[0, 104], [105, 348], [349, 558]]    #find where the input text belongs           for i in text:               find   = '22 feb 2017 21 april'        where  = text.index(find)    # 10      #removing block elements        for n in range(len(nBlocks)):        if where in range(nBlocks[n][0],nBlocks[n][1]):             for x in range(nBlocks[n][0],nBlocks[n][1]+1):                             #text starts          #text ends                 cleanText = text[0:nBlocks[n][0]]+text[nBlocks[n][1]+1::]    print(cleanText)    ###    risk true stories people never thought they d dare share    risk true stories people never    true stories people never thought    stories people never thought they    people never thought they d    never thought they d dare    thought they d dare share    $$$    ###    everyone hanging out without me mindy kaling non fiction    everyone hanging out without me    hanging out without me mindy    out without me mindy kaling    without me mindy kaling non    me mindy kaling non fiction    $$$
随时随地看视频慕课网APP

相关分类

Python
我要回答