用Python怎么把如下文件中的中文词条提取出来,并把这些中文做成json文件?

-------------------------------------------------------------------------------

File:D:\svn\aCenter\windows\dap\store\vdidc\web\vue-ui\src\components\datetime_range.vue

content:                'default': '至'

Line: 24

Time: 2018-03-26 08:46:13


-------------------------------------------------------------------------------

File:D:\svn\aCenter\windows\dap\store\vdidc\web\vue-ui\src\components\piece.vue

content:                <div><span class="branch-num">{{checkBranchNum}}</span><lang>个</lang><

Line: 6

Time: 2018-03-26 08:46:13


-------------------------------------------------------------------------------

File:D:\svn\aCenter\windows\dap\store\vdidc\web\vue-ui\src\components\piece.vue

content:                <div class="branch"><lang>分支</lang></div>

Line: 7

Time: 2018-03-26 0

........

比如文本中的,“至”,“个”,“分支”,做成json:


“至”:“至”,


“个”:“个”,


“分支”:“分支”


},


各位有什么骚代码都甩出来把。。。


收到一只叮咚
浏览 678回答 2
2回答

绝地无双

import res = '''File:D:\svn\aCenter\windows\dap\store\vdidc\web\vue-ui\src\components\datetime_range.vuecontent:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'default': '至'Line: 24Time: 2018-03-26 08:46:13-------------------------------------------------------------------------------File:D:\svn\aCenter\windows\dap\store\vdidc\web\vue-ui\src\components\piece.vuecontent:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <div><span class="branch-num">{{checkBranchNum}}</span><lang>个</lang><Line: 6Time: 2018-03-26 08:46:13-------------------------------------------------------------------------------File:D:\svn\aCenter\windows\dap\store\vdidc\web\vue-ui\src\components\piece.vuecontent:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <div class="branch"><lang>分支</lang></div>Line: 7Time: 2018-03-26 0'''p2 = re.compile(r'[^\u4e00-\u9fa5]')result = {i: i for i in " ".join(p2.split(s)).strip().split()}# {'个': '个', '至': '至', '分支': '分支'}优雅的写在本地,比如你的文件是1.txtimport rep2 = re.compile(r'[^\u4e00-\u9fa5]')with open('1.txt', 'r') as r:&nbsp; &nbsp; result = {i: i for i in ' '.join(p2.split(''.join(r.readlines()))).strip().split()}print(result) # {'个': '个', '分支': '分支', '至': '至'}

紫衣仙女

用规制式啊,字符编码在中文范围内的。这个关键是提取,用go语言好像比较方便,因为其内的规制式有中文标签&nbsp;go处理中文
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

JavaScript