求dba帮我看下mongodb负载高的问题

大神好,我最近接手了公司的一个新业务,是个巨坑,由mongodb存储的一个监控系统,监控公司整个系统接口的,接收各业务上报数据。整体来说上报的qps不高100以下,但我看机子负载高得吓人mongodb占用CPU时间98~99.5%,长期!
top-17:33:14up204days,14:37,2users,loadaverage:89.99,96.20,100.96
Tasks:389total,1running,386sleeping,0stopped,2zombie
Cpu(s):94.3%us,3.3%sy,0.0%ni,2.1%id,0.0%wa,0.0%hi,0.3%si,0.0%st
Mem:65855664ktotal,60083264kused,5772400kfree,169764kbuffers
Swap:8388600ktotal,0kused,8388600kfree,23005656kcached
PIDUSERPRNIVIRTRESSHRS%CPU%MEMTIME+COMMAND
59181root200114g33g8264S2348.153.534480,54mongod
也都不知以前的人是怎么部署的。我对mongodb不熟悉,暂时还找不到问题在哪,只能猜测是查询太多太慢导致压力巨大。并且线上版本mongostat没有idxmiss%字段,不好确定索引建得好不好下面我贴一些数据:1、mongostat数据,从这里可以看到读写都有的,其中写入没查询多,也没阻塞,倒是查询阻塞严重。
insertqueryupdatedeletegetmorecommandfaultslockeddbqr|qwar|awnetInnetOutconnsetrepltime
572183*059108|00monitor_1120_minute:0.0%0|0194|0984k1m318replica_monitorPRI17:17:37
261354*02988|00monitor_1016_minute:0.0%0|0192|0453k778k320replica_monitorPRI17:17:38
621734*066135|00monitor_1008_minute:0.0%0|0195|0535k829k318replica_monitorPRI17:17:39
481681*044175|00monitor_1261_minute:0.0%0|0192|04m5m316replica_monitorPRI17:17:41
327972*034212|00monitor_1204_minute:0.0%0|0185|0354k654k312replica_monitorPRI17:17:42
254527*029101|00monitor_1204_minute:0.0%0|0185|0182k608k311replica_monitorPRI17:17:43
141245*014104|00monitor_1005_minute:0.0%0|0177|0123k521k299replica_monitorPRI17:17:44
221033*02764|00monitor_1197_minute:0.0%0|0163|080k334k287replica_monitorPRI17:17:45
201101*02193|00monitor_1001_minute:0.0%0|0156|082k504k281replica_monitorPRI17:17:46
141073*016104|00monitor_1056_minute:0.0%0|0154|0114k474k278replica_monitorPRI17:17:47
insertqueryupdatedeletegetmorecommandfaultslockeddbqr|qwar|awnetInnetOutconnsetrepltime
211275*02393|00monitor_1022_minute:0.0%0|0143|083k425k267replica_monitorPRI17:17:48
201126*025110|00monitor_1131_minute:0.0%0|0139|091k523k261replica_monitorPRI17:17:49
15951*0691|00monitor_1036_minute:0.0%0|0130|054k322k252replica_monitorPRI17:17:51
161107*023118|00monitor_1113_minute:0.0%0|0131|0152k804k257replica_monitorPRI17:17:52
181152*020131|00monitor_1125_minute:0.0%0|0130|072k375k254replica_monitorPRI17:17:53
22962*01975|00monitor_1316_minute:0.0%0|0117|061k323k236replica_monitorPRI17:17:54
2、这是currentOp命令的简化分别输出了item.op,item.secs_running,item.client,item.desc,item.ns这些字段,可以看到很多查询用很长。10.1.16.223是本机,10.1.16.28是一个second。主要输出了查询时间1秒以上的
replica_monitor:PRIMARY>db.currentOp().inprog.forEach(function(item){if(item.secs_running>1){print(item.op,item.secs_running,item.client,item.desc,item.ns);}})db.currentOp().inprog.forEach(function(item){if(item.secs_running>1){print(item.op,item.secs_running,item.client,item.desc,item.ns);}})
query210.1.16.28:55143conn533341052monitor_1219_minute.diy_10_1_137_186
query410.1.16.223:13316conn533340660monitor_1093_minute.col_server
query410.1.16.223:13367conn533340690monitor_1178_minute.col_server
query210.1.16.223:13553conn533340935monitor_1226_minute.diy_10_1_1_227
query210.1.16.28:55254conn533341125monitor_1261_minute.diy_10_1_136_199
query510.1.16.223:13034conn533340328monitor_1131_minute.col_10_1_137_196
query410.1.16.223:13345conn533340676monitor_1146_minute.col_server
query210.1.16.28:54989conn533340916monitor_1075_minute.col_server
query710.1.16.28:53040conn533339313monitor_1056_minute.col_10_1_2_134
query410.1.16.223:13320conn533340663monitor_1017_minute.col_server
query210.1.16.223:13824conn533341185monitor_1131_minute.col_10_1_115_129
query210.1.16.223:13579conn533340952monitor_1237_minute.diy_10_1_18_33
query510.1.16.28:53729conn533339516monitor_1434_minute.col_10_1_112_37
query310.1.16.28:54891conn533340771monitor_1209_minute.col_10_1_17_123
query410.1.16.223:13364conn533340687monitor_1169_minute.col_server
query210.1.16.223:13741conn533341103monitor_1271_minute.col_10_1_16_109
query510.1.16.28:53426conn533339973monitor_1131_minute.col_10_1_137_196
query310.1.16.28:54987conn533340914monitor_1013_minute.col_server
query310.1.16.28:53490conn533339992monitor_1342_minute.col_10_1_113_35
query510.1.16.28:53745conn533340486monitor_1446_minute.col_10_1_3_61
query310.1.16.28:54885conn533340768monitor_1204_minute.col_10_1_114_102
query410.1.16.223:13359conn533340682monitor_1160_minute.col_server
query310.1.16.28:54984conn533340911monitor_1003_minute.col_server
query210.1.16.223:13732conn533341096monitor_1261_minute.col_10_1_114_102
query310.1.16.28:54973conn533340900monitor_1113_minute.col_server
query410.1.16.223:13165conn533340559monitor_1367_minute.col_10_1_137_67
query310.1.16.28:54979conn533340906monitor_1004_minute.col_server
query410.1.16.223:13350conn533340679monitor_1139_minute.col_server
query310.1.16.28:54971conn533340898monitor_1120_minute.col_server
query410.1.16.223:13311conn533340655monitor_1140_minute.col_server
query210.1.16.28:55039conn533340980monitor_1169_minute.diy_10_1_19_99
query810.1.16.223:12862conn533340167monitor_1204_minute.col_10_1_114_105
query310.1.16.28:53129conn533339357monitor_1200_minute.col_10_1_137_144
query310.1.16.223:13224conn533340585monitor_1185_minute.col_10_1_137_117
query310.1.16.223:13067conn533340351monitor_1339_minute.col_10_1_168_182
query410.1.16.223:13310conn533340654monitor_1120_minute.col_server
query310.1.16.28:54983conn533340910monitor_1136_minute.col_server
query410.1.16.223:13326conn533340667monitor_1003_minute.col_server
query310.1.16.28:53178conn533339383monitor_1226_minute.diy_10_1_18_119
query310.1.16.28:54969conn533340896monitor_1036_minute.col_server
我mongo也不熟悉,不知从何方面入手可以精准定位问题所在,求大神指导。
翻阅古今
浏览 457回答 2
2回答

GCT1015

1、首先需要比较历史情况,业务量和性能情况是否有巨大的变化?2、从您的CPU的监控情况来看,感觉主要是CPU资源不足,是否可以考虑增加CPU资源;或者将有的查询指向其中的副本集。3、索引暂时没有看到不妥当的地方,需要针对特定的查询语句另行分析。供参考!LoveMongoDB!Havefun!
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

JavaScript