使用正则表达式匹配成绩单中的名称、对话和动作

给定如下所示的字符串对话,我需要找到与每个用户对应的句子。


text = 'CHRIS: Hello, how are you...

PETER: Great, you? PAM: He is resting.

[PAM SHOWS THE COUCH]

[PETER IS NODDING HIS HEAD]

CHRIS: Are you ok?'

对于上述对话,我想返回包含三个元素的元组:

  1. 人名

  2. 小写的句子和

  3. 括号内的句子

像这样的东西:

('CHRIS', 'Hello, how are you...', None)


('PETER', 'Great, you?', None)


('PAM', 'He is resting', 'PAM SHOWS THE COUCH. PETER IS NODDING HIS HEAD')


('CHRIS', 'Are you ok?', None)


etc...

我正在尝试使用正则表达式来实现上述目的。到目前为止,我能够使用以下代码获取用户的姓名。我正在努力识别两个用户之间的句子。


actors = re.findall(r'\w+(?=\s*:[^/])',text)


有只小跳蛙
浏览 254回答 3
3回答

蛊毒传说

正则表达式是解决此问题的一种方法,但您也可以将其视为遍历文本中的每个标记并应用一些逻辑来形成组。例如,我们可以先找到一组名称和文本:from itertools import groupbydef isName(word):    # Names end with ':'    return word.endswith(":")text_split = [    " ".join(list(g)).rstrip(":")     for i, g in groupby(text.replace("]", "] ").split(), isName)]print(text_split)#['CHRIS',# 'Hello, how are you...',# 'PETER',# 'Great, you?',# 'PAM',# 'He is resting. [PAM SHOWS THE COUCH] [PETER IS NODDING HIS HEAD]',# 'CHRIS',# 'Are you ok?']接下来,您可以将成对的连续元素收集text_split到元组中:print([(text_split[i*2], text_split[i*2+1]) for i in range(len(text_split)//2)])#[('CHRIS', 'Hello, how are you...'),# ('PETER', 'Great, you?'),# ('PAM', 'He is resting. [PAM SHOWS THE COUCH] [PETER IS NODDING HIS HEAD]'),# ('CHRIS', 'Are you ok?')]我们几乎达到了所需的输出。我们只需要处理方括号中的文本。您可以为此编写一个简单的函数。(诚然,正则表达式是这里的一个选项,但我在这个答案中故意避免这样做。)这是我想出的快速方法:def isClosingBracket(word):    return word.endswith("]")def processWords(words):    if "[" not in words:        return [words, None]    else:        return [            " ".join(g).replace("]", ".")             for i, g in groupby(map(str.strip, words.split("[")), isClosingBracket)        ]print(    [(text_split[i*2], *processWords(text_split[i*2+1])) for i in range(len(text_split)//2)])#[('CHRIS', 'Hello, how are you...', None),# ('PETER', 'Great, you?', None),# ('PAM', 'He is resting.', 'PAM SHOWS THE COUCH. PETER IS NODDING HIS HEAD.'),# ('CHRIS', 'Are you ok?', None)]请注意,使用 将*的结果解包processWords到tuple严格来说是python 3 的功能。

守候你守候我

你可以这样做re.findall:>>> re.findall(r'\b(\S+):([^:\[\]]+?)\n?(\[[^:]+?\]\n?)?(?=\b\S+:|$)', text)[('CHRIS', ' Hello, how are you...', ''), ('PETER', ' Great, you? ', ''), ('PAM',  ' He is resting.',  '[PAM SHOWS THE COUCH]\n[PETER IS NODDING HIS HEAD]\n'), ('CHRIS', ' Are you ok?', '')]您将必须弄清楚如何自己删除方括号,这在仍然尝试匹配所有内容的同时使用正则表达式无法完成。正则表达式分解\b              # Word boundary(\S+)           # First capture group, string of characters not having a space:               # Colon(               # Second capture group    [^          # Match anything that is not...        :       #     a colon        \[\]    #     or square braces    ]+?         # Non-greedy match)\n?             # Optional newline(               # Third capture group    \[          # Literal opening brace    [^:]+?      # Similar to above - exclude colon from match    \]     \n?         # Optional newlines)?              # Third capture group is optional(?=             # Lookahead for...     \b          #     a word boundary, followed by      \S+         #     one or more non-space chars, and    :           #     a colon    |           # Or,    $           # EOL)
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python