如何随机化传入格式未知的字符串中的数字?

对于 NLP 项目,我需要根据训练示例生成用于训练目的的随机数字字符串。数字以字符串形式出现(来自 OCR)。让我将此处的问题陈述限制为百分比值,其中到目前为止观察到的格式包括以下格式或指出的格式特征的任何有意义的组合:


'60'       # no percentage sign, precision 0, no other characters

'60.00'    # no percentage sign, precision 2, dot for digit separation

'60,000'   # no percentage sign, precision 3, comma for digit separation

'60.0000'  # no percentage sign, precision 4, dot for digit separation

'60.00%'   # same as above, with percentage sign

'60.00 %'  # same as above, with whitespace

'100%'     # three digits, zero precision, percentage sign

'5'        # single digit

'% 60'     # percentage sign in front of the number, whitespace

我的目标是在保留每个字符格式的同时随机化数字(例外:由于数字数量不同,当 5.6 可以随机化为 18.7 或 100.0 时,反之亦然)。百分比数值应介于 0 和 100 之间。举几个我需要它的例子:


input  = '5'  # integer-like digit

output = [  '7', 

           '18', 

          '100'] 


input  =  '100.00 %' # 2-precision float with whitespace & percentage sign

output = [  '5.38 %', 

           '38.05 %', 

          '100.00 %']  


inpput =  '% 60,000' # percentage sign, whitespace, 4-precision float, comma separator

output = ['% 5,5348', 

          '% 48,7849', 

          '% 100,0000'] 

我怎么能这样做?解决方案可以是概念性的,也可以是代码示例。解决方案需要反映真实数据中可能出现的格式


到目前为止,我所知道的最好的方法是为我能想到的每种格式变体强制手写 if 子句。


慕侠2389804
浏览 130回答 2
2回答

胡子哥哥

以下内容似乎适用于您提供的示例输入。我们只对找到前导整数数字和后面跟有更多数字的潜在分隔符感兴趣。我们实际上不需要寻找任何空格或百分号,因为无论如何我们只对替换任何给定匹配项中的数字感兴趣。如果我错过了什么,请告诉我:import repattern = "\\d{1,3}((?P<separator>[,.])(?P<floating>\\d+))?"strings = (&nbsp; &nbsp; "60",&nbsp; &nbsp; "60.00",&nbsp; &nbsp; "60,000",&nbsp; &nbsp; "60.0000",&nbsp; &nbsp; "60.00%",&nbsp; &nbsp; "60.00 %",&nbsp; &nbsp; "100%",&nbsp; &nbsp; "5",&nbsp; &nbsp; "% 60",&nbsp; &nbsp; "% 60,000")def randomize(match):&nbsp; &nbsp; from random import uniform&nbsp; &nbsp; integer, floating = divmod(uniform(0, 100), 1)&nbsp; &nbsp; def get_chars():&nbsp; &nbsp; &nbsp; &nbsp; yield str(int(integer))&nbsp; &nbsp; &nbsp; &nbsp; if match.group("separator") is not None:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; yield match.group("separator")&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; precision = len(match.group("floating"))&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; yield f"{{:.{precision}f}}".format(floating)[2:]&nbsp; &nbsp; return "".join(get_chars())&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;for string in strings:&nbsp; &nbsp; print(re.sub(pattern, randomize, string))输出:2995.0851,5079.17830.80%6.56 %16%22% 27% 93,174>>>&nbsp;

阿波罗的战车

可以调用以下函数来生成您的情况所需的随机数。您可以进一步修改它以最适合您的情况。import numpy as npdef random_gen():&nbsp; &nbsp; precison = np.random.randint(0,6)&nbsp; &nbsp; val = np.random.uniform(0, 100)&nbsp; &nbsp; val = round(val,int(precison))&nbsp; &nbsp; val = str(val)&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; white_space = np.random.randint(0,3)&nbsp; &nbsp; rand_index = np.random.randint(0,len(val))&nbsp; &nbsp; val = val[0:rand_index] + ' '*white_space + val[rand_index:]&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; if np.random.randint(0,2) > 0:&nbsp; &nbsp; &nbsp; &nbsp; if np.random.randint(0,2) > 0:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; val = val + "%"&nbsp; &nbsp; &nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; val = "%" + val&nbsp; &nbsp; return valrandom_gen()&nbsp; &nbsp; &nbsp;&nbsp;
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python