拆分唯一的字符串 - Python

我正在尝试找到解析此类字符串的最佳方法:


Operating Status: NOT AUTHORIZED Out of Service Date: None

我需要输出是这样的:


['Operating Status: NOT AUTHORIZED', 'Out of Service Data: None']

有没有一种简单的方法可以做到这一点?我正在解析数百个这样的字符串。没有确定性文本,但始终采用上述格式。


其他字符串示例:


MC/MX/FF Number(s): None  DUNS Number: -- 

Power Units: 1  Drivers: 1 

预期输出:


['MC/MX/FF Number(s): None, 'DUNS Number: --']

['Power Units: 1,  Drivers: 1 ']


温温酱
浏览 288回答 2
2回答

眼眸繁星

有两种方法。两者都是超级笨拙的,并且非常依赖于原始字符串的非常小的波动。但是,您可以修改代码以提供更多的灵活性。这两个选项都取决于满足这些特征的线......有问题的分组必须......以字母或斜线开头,可能大写该感兴趣的标题后跟一个冒号(“:”)仅抓住冒号后的第一个单词。方法一,正则表达式,这个只能抓取两块数据。第二组是“其他所有内容”,因为我无法正确重复搜索模式:P代码:import rel = [ 'MC/MX/FF Number(s): None DUNS Number: -- ', 'Power Units: 1 Drivers: 1 ' ]pattern = ''.join([                 "(", # Start capturing group                   "\s*[A-Z/]", # Any number of space, until and including only the first capital or forward slash                  ".+?\:", # any character (non-greedy) up to and including the colon                 "\s*", # One or more spaces                 "\w+\s*", # One or more alphanumeric chars i.e. [a-zA-Z0-9]                  ")", # End capturing group                  "(.*)"])for s in l:     m = re.search(pattern, s)    print("----------------")    try:        print(m.group(1))        print(m.group(2))        print(m.group(3))    except Exception as e:        pass输出:----------------MC/MX/FF Number(s): None DUNS Number: -- ----------------Power Units: 1 Drivers: 1 方法二,逐字解析字符串。此方法具有与正则表达式相同的基本特征,但可以执行两个以上感兴趣的块。它的工作原理...开始逐字解析每个字符串,并将其加载到 newstring.当它碰到冒号时,标记一个标志。将下一个循环中的第一个单词添加到newstring. 如果需要,您可以将其更改为 1-2、1-3 或 1-n 字。您也可以让它在colonflag设置后继续添加单词,直到满足某些条件,例如带有大写字母的单词……尽管这可能会中断诸如“无”之类的单词。你可以一直到遇到一个全大写的单词,但是一个非全大写的标题会破坏它。添加newstring到newlist,重置标志,并继续解析单词。代码:s =     'MC/MX/FF Number(s): None DUNS Number: -- ' for s in l:     newlist = []    newstring = ""    colonflag = False    for w in s.split():        newstring += " " + w        if colonflag:             newlist.append(newstring)            newstring = ""            colonflag = False        if ":" in w:            colonflag = True    print(newlist)输出:[' MC/MX/FF Number(s): None', ' DUNS Number: --'][' Power Units: 1', ' Drivers: 1']第三个选项: 创建所有预期标头的列表,例如header_list = ["Operating Status:", "Out of Service Date:", "MC/MX/FF Number(s):", "DUNS Number:", "Power Units:", "Drivers:", ]并根据这些标头进行拆分/解析。第四种选择使用自然语言处理和机器学习来实际找出逻辑句子的位置;)

牛魔王的故事

看看pyparsing。这似乎是表达单词组合、检测它们之间的关系(以语法方式)并产生结构化响应的最“自然”的方式......网上有很多教程和文档:使用 pyparsing 模块Pyparsing 入门Pyparseltongue:使用 Pyparsing 解析文本您可以使用 `pip install pyparsing' 安装 pyparsing解析:Operating&nbsp;Status:&nbsp;NOT&nbsp;AUTHORIZED&nbsp;Out&nbsp;of&nbsp;Service&nbsp;Date:&nbsp;None需要类似的东西:!/usr/bin/env python3# -*- coding: utf-8 -*-##&nbsp; test_pyparsing2.py##&nbsp; Copyright 2019 John Coppens <john@jcoppens.com>##&nbsp; This program is free software; you can redistribute it and/or modify#&nbsp; it under the terms of the GNU General Public License as published by#&nbsp; the Free Software Foundation; either version 2 of the License, or#&nbsp; (at your option) any later version.##&nbsp; This program is distributed in the hope that it will be useful,#&nbsp; but WITHOUT ANY WARRANTY; without even the implied warranty of#&nbsp; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.&nbsp; See the#&nbsp; GNU General Public License for more details.##&nbsp; You should have received a copy of the GNU General Public License#&nbsp; along with this program; if not, write to the Free Software#&nbsp; Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,#&nbsp; MA 02110-1301, USA.##import pyparsing as ppdef create_parser():&nbsp; &nbsp; opstatus = pp.Keyword("Operating Status:")&nbsp; &nbsp; auth&nbsp; &nbsp; &nbsp;= pp.Combine(pp.Optional(pp.Keyword("NOT"))) + pp.Keyword("AUTHORIZED")&nbsp; &nbsp; status&nbsp; &nbsp;= pp.Keyword("Out of Service Date:")&nbsp; &nbsp; date&nbsp; &nbsp; &nbsp;= pp.Keyword("None")&nbsp; &nbsp; part1&nbsp; &nbsp; = pp.Group(opstatus + auth)&nbsp; &nbsp; part2&nbsp; &nbsp; = pp.Group(status + date)&nbsp; &nbsp; return part1 + part2def main(args):&nbsp; &nbsp; parser = create_parser()&nbsp; &nbsp; msg = "Operating Status: NOT AUTHORIZED Out of Service Date: None"&nbsp; &nbsp; print(parser.parseString(msg))&nbsp; &nbsp; msg = "Operating Status: AUTHORIZED Out of Service Date: None"&nbsp; &nbsp; print(parser.parseString(msg))&nbsp; &nbsp; return 0if __name__ == '__main__':&nbsp; &nbsp; import sys&nbsp; &nbsp; sys.exit(main(sys.argv))运行程序:[['Operating Status:', 'NOT', 'AUTHORIZED'], ['Out of Service Date:', 'None']][['Operating Status:', '', 'AUTHORIZED'], ['Out of Service Date:', 'None']]使用Combine,Group您可以更改输出的组织方式。
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python