如何使用正则表达式从文本中解析缺少子字符串的子字符串

我想从字符串中解析出遵循特定格式的子字符串


调查回复的格式为


<survey_name>_<category_name>_<question_type>.<response_type>

例如:输入字符串


y_survey_category1_1st.no

x_survey_category2_2nd

survey_z_category_3_3rd.yes_more_7

x_survey_category_4_4th.excluded

survey_z_category5.yes_more_7

survey_z_category_6.yes_more_7

这是我到目前为止所拥有的。它适用于大多数情况,除了 question_type 是可选的(例如:上面的 5 和 6 个输入)。


以下是每个子部分的限制


 1. survey_name can only be one of the 3 values

 2. category_name will always be present and can have underscores

 3. question_type may be present and may have underscore in it

 4. response_type may be present and may have underscore in it

 5. Either question_type or response_type or both will always be present


(x_survey|y_survey|survey_z)_([\w_]+)_(1st|2nd|3rd|4th)[.]?(.*)

https://regex101.com/r/bGc0gM/1


有关如何修改正则表达式以使其适用于所有情况的任何帮助?


萧十郎
浏览 95回答 2
2回答

GCT1015

很难找到使用的正则表达式,question_type即使它是可选的,我强制类别字母/下划线并以数字结尾category&nbsp;:&nbsp;[a-z_]+\d all&nbsp;:&nbsp;(x_survey|y_survey|survey_z)_([a-z_]+\d)(?>_(1st|2nd|3rd|4th))?(?>\.(.*))?Regex demo

慕森卡

带有一段代码的简短版本显示了一个可以工作的正则表达式。该模式添加了额外的空格,因此您需要设置re.X标志。也设置re.I为忽略大小写。&nbsp; &nbsp; # Capture:&nbsp; &nbsp; # <survey_name>_<category_name>_<question_type>.<response_type>&nbsp; &nbsp; #&nbsp; &nbsp; &nbsp;(0)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;(1)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (2) or (4)&nbsp; &nbsp; &nbsp;(3) or (6)&nbsp; &nbsp; pat = r"""^(x_survey|y_survey|survey_z)&nbsp; &nbsp; # <sn>&nbsp; (0)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;_&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# _&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;([^.]+&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; # <cn> (1)&nbsp; &nbsp; &nbsp; &nbsp;(?:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# One of&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; _(1st|2nd|3rd|4th)&nbsp; [.]([\w]+)$ |&nbsp; # qt (2) & rt (3)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; _(1st|2nd|3rd|4th)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $ |&nbsp; # qt (4)&nbsp; &nbsp; &nbsp; &nbsp; (?<!_(1st|2nd|3rd|4th)) [.]([\w]+)$&nbsp; &nbsp; # rt (6)&nbsp; &nbsp; &nbsp; &nbsp;)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"""&nbsp; &nbsp; matcher = re.compile(pat, re.I | re.X)对于包含测试用例的解决方案的两个变体的较长版本:"""Format: <survey_name>_<category_name>_<question_type>.<response_type>&nbsp;1. survey_name can only be one of the 3 values&nbsp;2. category_name will always be present and can have underscores&nbsp;3. question_type may be present and may have underscore in it&nbsp;4. response_type may be present and may have underscore in it&nbsp;5. Either question_type or response_type or both will always be presentA) <survery_name> always there&nbsp; &nbsp; easy to find, one of three: (x_survey|y_survey|survey_z)B) <category_name> always there&nbsp; &nbsp; has 0 or more internal underscoresC) <question_type> optional&nbsp; &nbsp; one ore more internal underscores&nbsp; &nbsp; ends before a dot or at end of line&nbsp; &nbsp; one of 4 values: (1st|2nd|3rd|4th)D) <response_type> optional&nbsp; &nbsp; starts before a .&nbsp; &nbsp; ends at end of lineBoth category_name and question_type can have zero or more internalunderscores.&nbsp; This results in an ambiguity, since we have no way of knowingwhen category_name ends and question_type starts.Assume that question_type is one of the 4 values (1st|2nd|3rd|4th).&nbsp; thisresults in 3 valid cases and one that should not match:Format: <survey_name>_<category_name>_<question_type>.<response_type>0) Both question_type and response_type present&nbsp; &nbsp;(x_survey|y_survey|survey_z)_<category_name>_(1st|2nd|3rd|4th).<response_type>&nbsp; &nbsp;-->&nbsp; &nbsp;p1 = r"^(x_survey|y_survey|survey_z)_([^.]+)_(1st|2nd|3rd|4th)[.]([\w]+)$"&nbsp; # noqa:1) Only question_type and no response_type present&nbsp; &nbsp;(x_survey|y_survey|survey_z)_<category_name>_(1st|2nd|3rd|4th)&nbsp; &nbsp;-->&nbsp; &nbsp;p2 = r"^(x_survey|y_survey|survey_z)_([^.]+)_(1st|2nd|3rd|4th)$"2) No question_type and only response_type present&nbsp; &nbsp;(x_survey|y_survey|survey_z)_<category_name>.<response_type>&nbsp; &nbsp;-->&nbsp; &nbsp;p3 = r"^(x_survey|y_survey|survey_z)_([^.]+)(?<!_(1st|2nd|3rd|4th))[.]([\w]+)$"&nbsp; # noqa:3) Neither question_type nor response_type present&nbsp; &nbsp;(x_survey|y_survey|survey_z)_<category_name>&nbsp; &nbsp;Neither of p1, p2 nor p3 will match.Since the patterns are mutually exclusive we can try them one after the other.We could also combine them into one pattern.We can combine the three patterns in one large pattern or we can try them oneafter the other."""from collections import namedtupleimport reResponse = namedtuple('Response', ['survey_name',&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'category_name',&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'question_type',&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'response_type'])cases = ["survey_z__CATEGORY",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"y_survey_category1_1st.no",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"x_survey_category2_2nd",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"survey_z_category_3_3rd.yes_more_7",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"x_survey_category_4_4th.excluded",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"X_SURVEY_CATEGORY_4_4TH.excluded",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"survey_z_category5.yes_more_7",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"survey_z_category_6.yes_more_7",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"survey_z_category_7._yes_more_77",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"survey_z_category_8_._yes_more_88",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"survey_z_category_8888__foo._yes_more_77_",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"survey_z_category_22_22_1st_2nd._yes_more_77_",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"survey_z__CATEGORY_2222_1ST__2ND._yes_odd_77_",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"survey_z__CATEGORY_3333_1ST__2ND",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"survey_z__CATEGORY_foo_1ST__2ND._yes_odd_77_",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;]def parse_survey_response_1(line):&nbsp; &nbsp; """Parse a line with a survey response seting optional values not&nbsp; &nbsp; present to None.&nbsp; Return a Response or None when no match.&nbsp; &nbsp; Use a list of mutually exclusive patterns for line format:&nbsp; &nbsp; <survey_name>_<category_name>_<question_type>.<response_type>&nbsp; &nbsp; """&nbsp; &nbsp; # Format <sn>_<cn>_(1st|2nd|3rd|4th).<rt>&nbsp; &nbsp; # Format <sn>_<cn>_(1st|2nd|3rd|4th)&nbsp; &nbsp; # Format: <sn>_<cn>.<rt>&nbsp; &nbsp; prfx = r"^(x_survey|y_survey|survey_z)_([^.]+)"&nbsp; &nbsp; regexs = [&nbsp; &nbsp; &nbsp; &nbsp; prfx + r"_(1st|2nd|3rd|4th)[.]([\w]+)$",&nbsp; &nbsp; &nbsp; &nbsp;# 4 captures&nbsp; &nbsp; &nbsp; &nbsp; prfx + r"_(1st|2nd|3rd|4th)($)",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# 3+1 captures&nbsp; &nbsp; &nbsp; &nbsp; prfx + r"(?<!_(1st|2nd|3rd|4th))[.]([\w]+)$",&nbsp; # 4 captures&nbsp; &nbsp; &nbsp; &nbsp; ]&nbsp; &nbsp; matchers = [re.compile(r, re.I | re.X) for r in regexs]&nbsp; &nbsp; for m in matchers:&nbsp; &nbsp; &nbsp; &nbsp; parsed_line = m.search(line)&nbsp; &nbsp; &nbsp; &nbsp; if parsed_line:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; map_empty2none = (g if g else None for g in parsed_line.groups())&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return Response._make(map_empty2none)&nbsp; &nbsp; return Nonedef parse_survey_response_2(line):&nbsp; &nbsp; """Parse a line with a survey response seting optional values not&nbsp; &nbsp; present to None.&nbsp; Return a Response or None when no match.&nbsp; &nbsp; Use a one large pattern for line format:&nbsp; &nbsp; <survey_name>_<category_name>_<question_type>.<response_type>&nbsp; &nbsp; """&nbsp; &nbsp; # Capture:&nbsp; &nbsp; # <survey_name>_<category_name>_<question_type>.<response_type>&nbsp; &nbsp; #&nbsp; &nbsp; &nbsp;(0)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;(1)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; (2) or (4)&nbsp; &nbsp; &nbsp;(3) or (6)&nbsp; &nbsp; pat = r"""^(x_survey|y_survey|survey_z)&nbsp; &nbsp; # <sn>&nbsp; (0)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;_&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# _&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;([^.]+)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# <cn> (1)&nbsp; &nbsp; &nbsp; &nbsp;(?:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# One of&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; _(1st|2nd|3rd|4th)&nbsp; [.]([\w]+)$ |&nbsp; # qt (2) & rt (3)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; _(1st|2nd|3rd|4th)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $ |&nbsp; # qt (4)&nbsp; &nbsp; &nbsp; &nbsp; (?<!_(1st|2nd|3rd|4th)) [.]([\w]+)$&nbsp; &nbsp; # rt (6)&nbsp; &nbsp; &nbsp; &nbsp;)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"""&nbsp; &nbsp; matcher = re.compile(pat, re.I | re.X)&nbsp; &nbsp; parsed_line = matcher.search(line)&nbsp; &nbsp; if parsed_line:&nbsp; &nbsp; &nbsp; &nbsp; pg = list(parsed_line.groups())&nbsp; &nbsp; &nbsp; &nbsp; pg[2] = pg[2] if pg[2] else pg[4]&nbsp; # capture 2 or 4&nbsp; &nbsp; &nbsp; &nbsp; pg[3] = pg[3] if pg[3] else pg[6]&nbsp; # capture 3 or 6&nbsp; &nbsp; &nbsp; &nbsp; return Response._make(pg[:4])&nbsp; &nbsp; return Nonedef unparse_survey(response):&nbsp; &nbsp; if response.response_type:&nbsp; &nbsp; &nbsp; &nbsp; head = '_'.join(e for e in response[:-1] if e)&nbsp; &nbsp; &nbsp; &nbsp; unparsed = '.'.join([head, response.response_type])&nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; unparsed = '_'.join(e for e in response if e)&nbsp; &nbsp; return unparsedfor c in cases:&nbsp; &nbsp; p1 = parse_survey_response_1(c)&nbsp; &nbsp; p2 = parse_survey_response_2(c)&nbsp; &nbsp; print(c)&nbsp; &nbsp; print(p1)&nbsp; &nbsp; print(p2)&nbsp; &nbsp; print(20*'=')&nbsp; &nbsp; if p1 or p2:&nbsp; &nbsp; &nbsp; &nbsp; assert(c == unparse_survey(p1))&nbsp; &nbsp; &nbsp; &nbsp; assert(c == unparse_survey(p2))运行给出:run reex02.py&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;survey_z__CATEGORYNoneNone====================y_survey_category1_1st.noResponse(survey_name='y_survey', category_name='category1', question_type='1st', response_type='no')Response(survey_name='y_survey', category_name='category1', question_type='1st', response_type='no')====================x_survey_category2_2ndResponse(survey_name='x_survey', category_name='category2', question_type='2nd', response_type=None)Response(survey_name='x_survey', category_name='category2', question_type='2nd', response_type=None)====================survey_z_category_3_3rd.yes_more_7Response(survey_name='survey_z', category_name='category_3', question_type='3rd', response_type='yes_more_7')Response(survey_name='survey_z', category_name='category_3', question_type='3rd', response_type='yes_more_7')====================x_survey_category_4_4th.excludedResponse(survey_name='x_survey', category_name='category_4', question_type='4th', response_type='excluded')Response(survey_name='x_survey', category_name='category_4', question_type='4th', response_type='excluded')====================X_SURVEY_CATEGORY_4_4TH.excludedResponse(survey_name='X_SURVEY', category_name='CATEGORY_4', question_type='4TH', response_type='excluded')Response(survey_name='X_SURVEY', category_name='CATEGORY_4', question_type='4TH', response_type='excluded')====================survey_z_category5.yes_more_7Response(survey_name='survey_z', category_name='category5', question_type=None, response_type='yes_more_7')Response(survey_name='survey_z', category_name='category5', question_type=None, response_type='yes_more_7')====================survey_z_category_6.yes_more_7Response(survey_name='survey_z', category_name='category_6', question_type=None, response_type='yes_more_7')Response(survey_name='survey_z', category_name='category_6', question_type=None, response_type='yes_more_7')====================survey_z_category_7._yes_more_77Response(survey_name='survey_z', category_name='category_7', question_type=None, response_type='_yes_more_77')Response(survey_name='survey_z', category_name='category_7', question_type=None, response_type='_yes_more_77')====================survey_z_category_8_._yes_more_88Response(survey_name='survey_z', category_name='category_8_', question_type=None, response_type='_yes_more_88')Response(survey_name='survey_z', category_name='category_8_', question_type=None, response_type='_yes_more_88')====================survey_z_category_8888__foo._yes_more_77_Response(survey_name='survey_z', category_name='category_8888__foo', question_type=None, response_type='_yes_more_77_')Response(survey_name='survey_z', category_name='category_8888__foo', question_type=None, response_type='_yes_more_77_')====================survey_z_category_22_22_1st_2nd._yes_more_77_Response(survey_name='survey_z', category_name='category_22_22_1st', question_type='2nd', response_type='_yes_more_77_')Response(survey_name='survey_z', category_name='category_22_22_1st', question_type='2nd', response_type='_yes_more_77_')====================survey_z__CATEGORY_2222_1ST__2ND._yes_odd_77_Response(survey_name='survey_z', category_name='_CATEGORY_2222_1ST_', question_type='2ND', response_type='_yes_odd_77_')Response(survey_name='survey_z', category_name='_CATEGORY_2222_1ST_', question_type='2ND', response_type='_yes_odd_77_')====================survey_z__CATEGORY_3333_1ST__2NDResponse(survey_name='survey_z', category_name='_CATEGORY_3333_1ST_', question_type='2ND', response_type=None)Response(survey_name='survey_z', category_name='_CATEGORY_3333_1ST_', question_type='2ND', response_type=None)====================survey_z__CATEGORY_foo_1ST__2ND._yes_odd_77_Response(survey_name='survey_z', category_name='_CATEGORY_foo_1ST_', question_type='2ND', response_type='_yes_odd_77_')Response(survey_name='survey_z', category_name='_CATEGORY_foo_1ST_', question_type='2ND', response_type='_yes_odd_77_')====================
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python