使用正则表达式解析财务报表

我正在处理正则表达式查询以将特定模式的文本返回到组中。下面是我用正则表达式:r"([\w+ \-? \w+]* [\w+ ]+ [\(?\w+ \)?]*) (\(?[\d,-]+\)?) (\(?[\d,-]+\)?)"。以下是我正在解析的示例行以及我希望输出的内容:


1) String: LOSS BEFORE INCOME TAXES (900,000) (900,000)

Desired output: [('LOSS BEFORE INCOME TAXES', '(900,000)', '(900,000)')]

Final result: correct 


2) String: INCOME TAXES (RECOVERED) (90,000) (90,000)

Desired output: [('INCOME TAXES (RECOVERED)', '(90,000)', '(90,000)')]

Final result: correct


3) String: RETAINED EARNINGS - BEGINNING OF YEAR 9,999,999 9,999,999

Desired output: [('RETAINED EARNINGS - BEGINNING OF YEAR', '9,999,999', '9,999,999')]

Final result: correct


4) String: EXPENSES

Desired output: ['EXPENSES']

Final result: correct


5) String: Subcontracts 8,058 2,655

Desired output: [('Subcontracts', '8,000,000')]

Final result: ['Subcontracts 8', '', '058 2', '', '655', '']


6) String: Business taxes 116 -

Desired output: [('Business taxes', '116', '-')]

Final result: ['Business taxes 116 ', '', '']


7) String: 600,000 600,000

Desired output: [(600,000), (600,000)]

Final result: ['642', '', '437 629', '', '070', '']


8) String: Salaries, wages and benefits 400,000 400,000

Desired output: [('Salaries, wages and benefits', '400,000', '400,000')]

Final result: [(' wages and benefits', '463,437', '466,742')]

我不确定我做错了什么或我错过了什么,但是 5、6、7 和 8 有问题。如何调整上述查询以使其涵盖所有提到的情况?提前致谢!


慕慕森
浏览 218回答 3
3回答

MYYA

你可以试试这个队友^([a-z, \(\)-]*?)?\(?([\d,]+)?\)?\s*?\(?([\d,-]+)?\)?$解释^ - 锚定到字符串的开头。([a-z, \(\)-]+?)?- 匹配任何字符 a 到 z,或,or(或 ')` 或 '-' 零次或多次(懒惰模式)。\(?- 匹配((?使其成为可选)。([\d,]+)?- 匹配任何数字或,一次或多次。(?使其成为可选)。\)- 匹配)。\s*? - 匹配空间零次或多次。(?([\d,-]+)?\)?- 匹配任何数字或-。$ - 字符串结束。

慕娘9325324

我认为这个正则表达式会做你想做的:^([A-Z][A-Za-z0-9 (),%;-]+?[^(\d\s])? ?(?:(\(?[\d,]+\)?|-)\s+(\(?[\d,]+\)?|-))?$它查找一组字母字符,以字母开头,可能包括一些[(),%;-],但不以 a (、数字或空格结尾,后跟两组可能()包围的数字和,或-。所有组都是可选的,以允许匹配没有描述或没有数字的行。在 Python 中:import redata = """LOSS BEFORE INCOME TAXES (900,000) (900,000)INCOME TAXES (RECOVERED) (90,000) (90,000)RETAINED EARNINGS - BEGINNING OF YEAR 9,999,999 9,999,999EXPENSESSubcontracts 8,058 2,655Business taxes 116 -600,000 600,000GROSS PROFIT (50%; 2016 - 50%) 500,000 500,000Bad debts - 50Salaries, wages and benefits 400,000 400,000"""regex = re.compile('^([A-Z][A-Za-z0-9 (),%;-]+?[^(\d\s])? ?(?:(\(?[\d,]+\)?|-)\s+(\(?[\d,]+\)?|-))?$', re.MULTILINE)print regex.findall(data)输出:[('LOSS BEFORE INCOME TAXES', '(900,000)', '(900,000)'), ('INCOME TAXES (RECOVERED)', '(90,000)', '(90,000)'), ('RETAINED EARNINGS - BEGINNING OF YEAR', '9,999,999', '9,999,999'), ('EXPENSES', '', ''), ('Subcontracts', '8,058', '2,655'), ('Business taxes', '116', '-'), ('', '600,000', '600,000'), ('GROSS PROFIT (50%; 2016 - 50%)', '500,000', '500,000'), ('Bad debts', '-', '50'), ('Salaries, wages and benefits', '400,000', '400,000')]

江户川乱折腾

试试下面的正则表达式r"([\w ,()-]*)[\(?[\d, -]*\)?]*[\(?[\d, -]*\)?]*"
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python