mongodb - Why every few minutes activity on local.oplog.rs locks mongo clients -
the problem:
every 1 or 2 minutes, mongo clients halt 3 seconds. normal operation time updates 1 or 2 milliseconds. when slowness appear, have bunch of updates lasting 1 3 seconds.
the slow queries log not show related this. neither debugging mongo client (mongo-php-client).
the current architecture has 1 master, 1 slave , 1 arbiter in replica set.
the queries executed same sort (upsert _id, insert new mongoid). there no "every few minutes run super-expensive update"
the blocking seems caused local.oplog.rs. @ least, mongotop output shows below. haven't found indication secondary causing issue, outputs of following commands seem stable there. haven't found information pointing slowness caused specific query either.
the idea behind way store data pre-aggregated reports. have lot of updates (couple of hundreds per second), low rate of queries. indexes date bound (except _id, calculated composite key based on dimensions record contains. whit this, mean _id not incremental, objectid index). give idea, indexes in biggest collection (in mb):
"indexsizes" : { "_id_" : 1967, "owner_date" : 230, "flow_date" : 231, "date" : 170 }, most of other collections have indexes of size of 100mb or less. in collections, _id index biggest one. worth noting ids generated manually (based on metadata insertion done upsert, , not incremental)
followup:
edit1: after more digging, seems locking related rsync process of journal, when flushing disk. journal in same filesystem data files, slowness in not expected disks fast ssd devices.
edit2: after testings, writing capability of disks not issue. writing @ rate of several megabytes per second. running tests on on disk accepts 150mb/s without problem
the expected answer:
why downtime happening?
- pointers possible reasons, in order investigate further
- experience/solution based on similar cases
issue explanation:
the following commands run in primary node
every time slowness appears, see following in mongostat (two examples)
insert query update delete getmore command % dirty % used flushes vsize res qr|qw ar|aw netin netout conn set repl time 10 *0 141 *0 93 120|0 0.4 80.0 0 11.5g 9.4g 0|1 1|0 110k 71k 274 rs0 pri 13:15:44 12 *0 178 *0 72 105|0 0.2 80.0 1 11.5g 9.4g 0|0 1|0 111k 79k 274 rs0 pri 13:15:45 47 *0 7 *0 0 159|0 0.1 80.0 0 11.5g 9.4g 0|0 2|1 15k 44k 274 rs0 pri 13:15:49 !!!!here 14 *0 929 *0 99 170|0 0.2 80.0 0 11.5g 9.4g 0|0 1|0 419k 417k 274 rs0 pri 13:15:50 21 *0 287 *0 124 181|0 0.2 80.0 0 11.5g 9.4g 0|0 1|0 187k 122k 274 rs0 pri 13:15:51 insert query update delete getmore command % dirty % used flushes vsize res qr|qw ar|aw netin netout conn set repl time 10 *0 145 *0 70 108|0 0.3 79.9 0 11.5g 9.4g 0|0 1|0 98k 71k 274 rs0 pri 13:16:48 11 *0 155 *0 72 111|0 0.2 79.9 1 11.5g 9.4g 0|0 1|0 103k 75k 274 rs0 pri 13:16:49 44 *0 3 *0 0 144|0 0.0 79.9 0 11.5g 9.4g 0|2 1|0 11k 75k 274 rs0 pri 13:16:53 !!!!here 11 *0 837 *0 94 134|0 0.2 79.9 0 11.5g 9.4g 0|0 1|0 377k 348k 274 rs0 pri 13:16:54 12 *0 180 *0 86 139|0 0.2 79.9 0 11.5g 9.4g 0|0 1|0 122k 85k 274 rs0 pri 13:16:55 14 *0 195 *0 83 124|0 0.2 79.9 0 11.5g 9.4g 0|0 2|0 125k 89k 274 rs0 pri 13:16:56 the update column has drop, , following 1 more updates. note using mongostat 1 second delay. when slowness appears, mongostat stops replying few seconds. stop present in master, not in slave server
this output of mongotop when problem happens (at 2015-07-07t13:29:38):
(an example bit more of context can found here)
ns total read write 2015-07-07t13:29:33+02:00 database_name.d_date_flow_owner 555ms 550ms 4ms local.oplog.rs 61ms 53ms 7ms database_name.client_context_bbbbbbb 15ms 0ms 15ms database_name.d_date_landing_owner 15ms 0ms 15ms database_name.d_date_billing_owner 10ms 0ms 10ms database_name.w_bl_carrier_country_date_device_flow_landing_manager_op1_os_owner_prod_site 7ms 0ms 7ms database_name.d_carrier_country_date_device_flow_landing_op1_os_owner_site 5ms 0ms 5ms database_name.d_country_date_owner 5ms 0ms 5ms database_name.d_date_device_owner 5ms 0ms 5ms database_name.d_date_os_owner 5ms 0ms 5ms ns total read write 2015-07-07t13:29:37+02:00 database_name.client_context_bbbbbbb 2ms 0ms 2ms database_name.client_context_aaaaaaaaa 1ms 0ms 1ms admin.system.backup_users 0ms 0ms 0ms admin.system.namespaces 0ms 0ms 0ms admin.system.new_users 0ms 0ms 0ms admin.system.profile 0ms 0ms 0ms admin.system.roles 0ms 0ms 0ms admin.system.users 0ms 0ms 0ms admin.system.version 0ms 0ms 0ms database_name 0ms 0ms 0ms ns total read write 2015-07-07t13:29:38+02:00 local.oplog.rs 8171ms 4470ms 3701ms database_name.d_date_op1_owner 45ms 0ms 45ms database_name.d_date_device_owner 39ms 0ms 39ms database_name.m_date_owner 34ms 0ms 34ms database_name.d_date_owner 32ms 0ms 32ms database_name.d_date_owner_site 31ms 0ms 31ms database_name.d_carrier_date_owner 30ms 0ms 30ms database_name.d_date_flow_owner 30ms 0ms 30ms database_name.d_date_owner_product 28ms 0ms 28ms database_name.d_carrier_country_date_device_flow_landing_op1_os_owner_site 27ms 0ms 27ms ns total read write 2015-07-07t13:29:39+02:00 database_name.d_date_flow_owner 558ms 552ms 6ms local.oplog.rs 62ms 61ms 1ms database_name.d_carrier_date_owner 17ms 0ms 17ms database_name.d_date_owner 16ms 0ms 16ms database_name.w_bl_carrier_country_date_device_flow_landing_manager_op1_os_owner_prod_site 7ms 0ms 7ms database_name.d_date_billing_owner 6ms 0ms 6ms database_name.d_carrier_country_date_device_flow_landing_op1_os_owner_site 5ms 0ms 5ms database_name.d_country_date_owner 5ms 0ms 5ms database_name.d_date_device_owner 5ms 0ms 5ms database_name.d_date_op1_owner 5ms 0ms 5ms debug of php mongo client shows issue (the last 2 "phpmongoclient debug" lines):
(an example bit more of context can found here)
update duration: 2ms update duration: 1ms update duration: 1ms update duration: 4006ms phpmongoclient debug: 2015-07-07 10:40:26 - parse (info): parsing mongodb://primary_host.lan,secondary_host.lan phpmongoclient debug: 2015-07-07 10:40:26 - parse (info): - found node: primary_host.lan:27017 [....] phpmongoclient debug: 2015-07-07 10:40:26 - replset (fine): limiting credentials: done phpmongoclient debug: 2015-07-07 10:40:26 - replset (fine): sorting servers priority , ping time phpmongoclient debug: 2015-07-07 10:40:26 - replset (fine): - connection: type: primary, socket: 42, ping: 0, hash: primary_host.lan:27017;rs0;database_name/user/5ca571e7db198eeee3abee35857bfd53;30751 phpmongoclient debug: 2015-07-07 10:40:26 - replset (fine): sorting servers: done phpmongoclient debug: 2015-07-07 10:40:26 - replset (fine): selecting near servers phpmongoclient debug: 2015-07-07 10:40:26 - replset (fine): selecting near servers: nearest 0ms phpmongoclient debug: 2015-07-07 10:40:26 - replset (fine): - connection: type: primary, socket: 42, ping: 0, hash: primary_host.lan:27017;rs0;database_name/user/5ca571e7db198eeee3abee35857bfd53;30751 phpmongoclient debug: 2015-07-07 10:40:26 - replset (fine): selecting near server: done phpmongoclient debug: 2015-07-07 10:40:26 - replset (info): pick server: random element 0 phpmongoclient debug: 2015-07-07 10:40:26 - replset (info): - connection: type: primary, socket: 42, ping: 0, hash: primary_host.lan:27017;rs0;database_name/user/5ca571e7db198eeee3abee35857bfd53;30751 phpmongoclient debug: 2015-07-07 10:40:26 - con (fine): no timeout changes primary_host.lan:27017;rs0;database_name/user/5ca571e7db198eeee3abee35857bfd53;30751 phpmongoclient debug: 2015-07-07 10:40:30 - con (fine): no timeout changes primary_host.lan:27017;rs0;database_name/user/5ca571e7db198eeee3abee35857bfd53;30751 update duration: 3943ms update duration: 3476ms update duration: 2008ms update duration: 961ms update duration: 956ms update duration: 20ms update duration: 20ms update duration: 3ms update duration: 42ms update duration: 24ms update duration: 25ms update duration: 56ms update duration: 24ms update duration: 11ms update duration: 11ms update duration: 3ms update duration: 2ms update duration: 3ms update duration: 1ms update duration: 1ms update duration: 1ms update duration: 2ms mongo information:
- mongo version: 3.0.3
- replica set 1 slave , 1 arbiter
- replication lag varies between 0 , 4 seconds
- engine: wiredtiger
- filesystem: xfs
- operative system: red hat enterprise linux server release 7.1
- memory: 24gb. reported htop 40% used, 60% cache
this issue has disappeared. 2 actions undertaken:
- reworked pre-aggregated reports system. workload of mongo reduced 10 times.
- updated version of mongo 3.0.6
unfortunately, 2 changes put online no time in between them. have suspicion reducing workload did trick (which might or might not linked issue @steve-brisk pointed out) @ least (we'll see when hit again previous workload levels). but, version updated, might previous levels of workload might not hit issue again
i have no evidence pointing 1 of 2 solutions alone. after taking both of them, issue resolved
Comments
Post a Comment