大神好,我最近接手了公司的一个新业务,是个巨坑,由mongodb存储的一个监控系统,监控公司整个系统接口的,接收各业务上报数据。整体来说上报的qps不高100以下,但我看机子负载高得吓人mongodb占用CPU时间98~99.5%,长期!top-17:33:14up204days,14:37,2users,loadaverage:89.99,96.20,100.96Tasks:389total,1running,386sleeping,0stopped,2zombieCpu(s):94.3%us,3.3%sy,0.0%ni,2.1%id,0.0%wa,0.0%hi,0.3%si,0.0%stMem:65855664ktotal,60083264kused,5772400kfree,169764kbuffersSwap:8388600ktotal,0kused,8388600kfree,23005656kcachedPIDUSERPRNIVIRTRESSHRS%CPU%MEMTIME+COMMAND59181root200114g33g8264S2348.153.534480,54mongod也都不知以前的人是怎么部署的。我对mongodb不熟悉,暂时还找不到问题在哪,只能猜测是查询太多太慢导致压力巨大。并且线上版本mongostat没有idxmiss%字段,不好确定索引建得好不好下面我贴一些数据:1、mongostat数据,从这里可以看到读写都有的,其中写入没查询多,也没阻塞,倒是查询阻塞严重。insertqueryupdatedeletegetmorecommandfaultslockeddbqr|qwar|awnetInnetOutconnsetrepltime572183*059108|00monitor_1120_minute:0.0%0|0194|0984k1m318replica_monitorPRI17:17:37261354*02988|00monitor_1016_minute:0.0%0|0192|0453k778k320replica_monitorPRI17:17:38621734*066135|00monitor_1008_minute:0.0%0|0195|0535k829k318replica_monitorPRI17:17:39481681*044175|00monitor_1261_minute:0.0%0|0192|04m5m316replica_monitorPRI17:17:41327972*034212|00monitor_1204_minute:0.0%0|0185|0354k654k312replica_monitorPRI17:17:42254527*029101|00monitor_1204_minute:0.0%0|0185|0182k608k311replica_monitorPRI17:17:43141245*014104|00monitor_1005_minute:0.0%0|0177|0123k521k299replica_monitorPRI17:17:44221033*02764|00monitor_1197_minute:0.0%0|0163|080k334k287replica_monitorPRI17:17:45201101*02193|00monitor_1001_minute:0.0%0|0156|082k504k281replica_monitorPRI17:17:46141073*016104|00monitor_1056_minute:0.0%0|0154|0114k474k278replica_monitorPRI17:17:47insertqueryupdatedeletegetmorecommandfaultslockeddbqr|qwar|awnetInnetOutconnsetrepltime211275*02393|00monitor_1022_minute:0.0%0|0143|083k425k267replica_monitorPRI17:17:48201126*025110|00monitor_1131_minute:0.0%0|0139|091k523k261replica_monitorPRI17:17:4915951*0691|00monitor_1036_minute:0.0%0|0130|054k322k252replica_monitorPRI17:17:51161107*023118|00monitor_1113_minute:0.0%0|0131|0152k804k257replica_monitorPRI17:17:52181152*020131|00monitor_1125_minute:0.0%0|0130|072k375k254replica_monitorPRI17:17:5322962*01975|00monitor_1316_minute:0.0%0|0117|061k323k236replica_monitorPRI17:17:542、这是currentOp命令的简化分别输出了item.op,item.secs_running,item.client,item.desc,item.ns这些字段,可以看到很多查询用很长。10.1.16.223是本机,10.1.16.28是一个second。主要输出了查询时间1秒以上的replica_monitor:PRIMARY>db.currentOp().inprog.forEach(function(item){if(item.secs_running>1){print(item.op,item.secs_running,item.client,item.desc,item.ns);}})db.currentOp().inprog.forEach(function(item){if(item.secs_running>1){print(item.op,item.secs_running,item.client,item.desc,item.ns);}})query210.1.16.28:55143conn533341052monitor_1219_minute.diy_10_1_137_186query410.1.16.223:13316conn533340660monitor_1093_minute.col_serverquery410.1.16.223:13367conn533340690monitor_1178_minute.col_serverquery210.1.16.223:13553conn533340935monitor_1226_minute.diy_10_1_1_227query210.1.16.28:55254conn533341125monitor_1261_minute.diy_10_1_136_199query510.1.16.223:13034conn533340328monitor_1131_minute.col_10_1_137_196query410.1.16.223:13345conn533340676monitor_1146_minute.col_serverquery210.1.16.28:54989conn533340916monitor_1075_minute.col_serverquery710.1.16.28:53040conn533339313monitor_1056_minute.col_10_1_2_134query410.1.16.223:13320conn533340663monitor_1017_minute.col_serverquery210.1.16.223:13824conn533341185monitor_1131_minute.col_10_1_115_129query210.1.16.223:13579conn533340952monitor_1237_minute.diy_10_1_18_33query510.1.16.28:53729conn533339516monitor_1434_minute.col_10_1_112_37query310.1.16.28:54891conn533340771monitor_1209_minute.col_10_1_17_123query410.1.16.223:13364conn533340687monitor_1169_minute.col_serverquery210.1.16.223:13741conn533341103monitor_1271_minute.col_10_1_16_109query510.1.16.28:53426conn533339973monitor_1131_minute.col_10_1_137_196query310.1.16.28:54987conn533340914monitor_1013_minute.col_serverquery310.1.16.28:53490conn533339992monitor_1342_minute.col_10_1_113_35query510.1.16.28:53745conn533340486monitor_1446_minute.col_10_1_3_61query310.1.16.28:54885conn533340768monitor_1204_minute.col_10_1_114_102query410.1.16.223:13359conn533340682monitor_1160_minute.col_serverquery310.1.16.28:54984conn533340911monitor_1003_minute.col_serverquery210.1.16.223:13732conn533341096monitor_1261_minute.col_10_1_114_102query310.1.16.28:54973conn533340900monitor_1113_minute.col_serverquery410.1.16.223:13165conn533340559monitor_1367_minute.col_10_1_137_67query310.1.16.28:54979conn533340906monitor_1004_minute.col_serverquery410.1.16.223:13350conn533340679monitor_1139_minute.col_serverquery310.1.16.28:54971conn533340898monitor_1120_minute.col_serverquery410.1.16.223:13311conn533340655monitor_1140_minute.col_serverquery210.1.16.28:55039conn533340980monitor_1169_minute.diy_10_1_19_99query810.1.16.223:12862conn533340167monitor_1204_minute.col_10_1_114_105query310.1.16.28:53129conn533339357monitor_1200_minute.col_10_1_137_144query310.1.16.223:13224conn533340585monitor_1185_minute.col_10_1_137_117query310.1.16.223:13067conn533340351monitor_1339_minute.col_10_1_168_182query410.1.16.223:13310conn533340654monitor_1120_minute.col_serverquery310.1.16.28:54983conn533340910monitor_1136_minute.col_serverquery410.1.16.223:13326conn533340667monitor_1003_minute.col_serverquery310.1.16.28:53178conn533339383monitor_1226_minute.diy_10_1_18_119query310.1.16.28:54969conn533340896monitor_1036_minute.col_server我mongo也不熟悉,不知从何方面入手可以精准定位问题所在,求大神指导。
GCT1015
相关分类