mongodb - Why every few minutes activity on local.oplog.rs locks mongo clients -


the problem:

every 1 or 2 minutes, mongo clients halt 3 seconds. normal operation time updates 1 or 2 milliseconds. when slowness appear, have bunch of updates lasting 1 3 seconds.

the slow queries log not show related this. neither debugging mongo client (mongo-php-client).

the current architecture has 1 master, 1 slave , 1 arbiter in replica set.

the queries executed same sort (upsert _id, insert new mongoid). there no "every few minutes run super-expensive update"

the blocking seems caused local.oplog.rs. @ least, mongotop output shows below. haven't found indication secondary causing issue, outputs of following commands seem stable there. haven't found information pointing slowness caused specific query either.

the idea behind way store data pre-aggregated reports. have lot of updates (couple of hundreds per second), low rate of queries. indexes date bound (except _id, calculated composite key based on dimensions record contains. whit this, mean _id not incremental, objectid index). give idea, indexes in biggest collection (in mb):

"indexsizes" : {     "_id_" : 1967,     "owner_date" : 230,     "flow_date" : 231,     "date" : 170 }, 

most of other collections have indexes of size of 100mb or less. in collections, _id index biggest one. worth noting ids generated manually (based on metadata insertion done upsert, , not incremental)

followup:

edit1: after more digging, seems locking related rsync process of journal, when flushing disk. journal in same filesystem data files, slowness in not expected disks fast ssd devices.

edit2: after testings, writing capability of disks not issue. writing @ rate of several megabytes per second. running tests on on disk accepts 150mb/s without problem

the expected answer:

why downtime happening?

  • pointers possible reasons, in order investigate further
  • experience/solution based on similar cases

issue explanation:

the following commands run in primary node

every time slowness appears, see following in mongostat (two examples)

insert query update delete getmore command % dirty % used flushes vsize  res qr|qw ar|aw netin netout conn set repl     time     10    *0    141     *0      93   120|0     0.4   80.0       0 11.5g 9.4g   0|1   1|0  110k    71k  274 rs0  pri 13:15:44     12    *0    178     *0      72   105|0     0.2   80.0       1 11.5g 9.4g   0|0   1|0  111k    79k  274 rs0  pri 13:15:45     47    *0      7     *0       0   159|0     0.1   80.0       0 11.5g 9.4g   0|0   2|1   15k    44k  274 rs0  pri 13:15:49 !!!!here     14    *0    929     *0      99   170|0     0.2   80.0       0 11.5g 9.4g   0|0   1|0  419k   417k  274 rs0  pri 13:15:50     21    *0    287     *0     124   181|0     0.2   80.0       0 11.5g 9.4g   0|0   1|0  187k   122k  274 rs0  pri 13:15:51  insert query update delete getmore command % dirty % used flushes vsize  res qr|qw ar|aw netin netout conn set repl     time     10    *0    145     *0      70   108|0     0.3   79.9       0 11.5g 9.4g   0|0   1|0   98k    71k  274 rs0  pri 13:16:48     11    *0    155     *0      72   111|0     0.2   79.9       1 11.5g 9.4g   0|0   1|0  103k    75k  274 rs0  pri 13:16:49     44    *0      3     *0       0   144|0     0.0   79.9       0 11.5g 9.4g   0|2   1|0   11k    75k  274 rs0  pri 13:16:53 !!!!here     11    *0    837     *0      94   134|0     0.2   79.9       0 11.5g 9.4g   0|0   1|0  377k   348k  274 rs0  pri 13:16:54     12    *0    180     *0      86   139|0     0.2   79.9       0 11.5g 9.4g   0|0   1|0  122k    85k  274 rs0  pri 13:16:55     14    *0    195     *0      83   124|0     0.2   79.9       0 11.5g 9.4g   0|0   2|0  125k    89k  274 rs0  pri 13:16:56 

the update column has drop, , following 1 more updates. note using mongostat 1 second delay. when slowness appears, mongostat stops replying few seconds. stop present in master, not in slave server

this output of mongotop when problem happens (at 2015-07-07t13:29:38):

(an example bit more of context can found here)

                                                                                        ns    total     read    write    2015-07-07t13:29:33+02:00                                                            database_name.d_date_flow_owner    555ms    550ms      4ms                                                                                                          local.oplog.rs     61ms     53ms      7ms                                                                                    database_name.client_context_bbbbbbb     15ms      0ms     15ms                                                                                      database_name.d_date_landing_owner     15ms      0ms     15ms                                                                                      database_name.d_date_billing_owner     10ms      0ms     10ms                              database_name.w_bl_carrier_country_date_device_flow_landing_manager_op1_os_owner_prod_site      7ms      0ms      7ms                                              database_name.d_carrier_country_date_device_flow_landing_op1_os_owner_site      5ms      0ms      5ms                                                                                      database_name.d_country_date_owner      5ms      0ms      5ms                                                                                       database_name.d_date_device_owner      5ms      0ms      5ms                                                                                           database_name.d_date_os_owner      5ms      0ms      5ms                                                                         ns    total    read    write    2015-07-07t13:29:37+02:00         database_name.client_context_bbbbbbb      2ms     0ms      2ms                                    database_name.client_context_aaaaaaaaa      1ms     0ms      1ms                                                 admin.system.backup_users      0ms     0ms      0ms                                                   admin.system.namespaces      0ms     0ms      0ms                                                    admin.system.new_users      0ms     0ms      0ms                                                      admin.system.profile      0ms     0ms      0ms                                                        admin.system.roles      0ms     0ms      0ms                                                        admin.system.users      0ms     0ms      0ms                                                      admin.system.version      0ms     0ms      0ms                                                             database_name      0ms     0ms      0ms                                                                                                       ns     total      read     write    2015-07-07t13:29:38+02:00                                                             local.oplog.rs    8171ms    4470ms    3701ms                                                                          database_name.d_date_op1_owner      45ms       0ms      45ms                                                                       database_name.d_date_device_owner      39ms       0ms      39ms                                                                              database_name.m_date_owner      34ms       0ms      34ms                                                                              database_name.d_date_owner      32ms       0ms      32ms                                                                         database_name.d_date_owner_site      31ms       0ms      31ms                                                                      database_name.d_carrier_date_owner      30ms       0ms      30ms                                                                         database_name.d_date_flow_owner      30ms       0ms      30ms                                                                      database_name.d_date_owner_product      28ms       0ms      28ms                              database_name.d_carrier_country_date_device_flow_landing_op1_os_owner_site      27ms       0ms      27ms                                                                                                                       ns    total     read    write    2015-07-07t13:29:39+02:00                                                            database_name.d_date_flow_owner    558ms    552ms      6ms                                                                                                          local.oplog.rs     62ms     61ms      1ms                                                                                      database_name.d_carrier_date_owner     17ms      0ms     17ms                                                                                              database_name.d_date_owner     16ms      0ms     16ms                              database_name.w_bl_carrier_country_date_device_flow_landing_manager_op1_os_owner_prod_site      7ms      0ms      7ms                                                                                      database_name.d_date_billing_owner      6ms      0ms      6ms                                              database_name.d_carrier_country_date_device_flow_landing_op1_os_owner_site      5ms      0ms      5ms                                                                                      database_name.d_country_date_owner      5ms      0ms      5ms                                                                                       database_name.d_date_device_owner      5ms      0ms      5ms                                                                                          database_name.d_date_op1_owner      5ms      0ms      5ms                              

debug of php mongo client shows issue (the last 2 "phpmongoclient debug" lines):

(an example bit more of context can found here)

update duration: 2ms update duration: 1ms update duration: 1ms update duration: 4006ms     phpmongoclient debug: 2015-07-07 10:40:26 -  parse (info): parsing mongodb://primary_host.lan,secondary_host.lan     phpmongoclient debug: 2015-07-07 10:40:26 -  parse (info): - found node: primary_host.lan:27017     [....]     phpmongoclient debug: 2015-07-07 10:40:26 -  replset (fine): limiting credentials: done     phpmongoclient debug: 2015-07-07 10:40:26 -  replset (fine): sorting servers priority , ping time     phpmongoclient debug: 2015-07-07 10:40:26 -  replset (fine): - connection: type: primary, socket: 42, ping: 0, hash: primary_host.lan:27017;rs0;database_name/user/5ca571e7db198eeee3abee35857bfd53;30751     phpmongoclient debug: 2015-07-07 10:40:26 -  replset (fine): sorting servers: done     phpmongoclient debug: 2015-07-07 10:40:26 -  replset (fine): selecting near servers     phpmongoclient debug: 2015-07-07 10:40:26 -  replset (fine): selecting near servers: nearest 0ms     phpmongoclient debug: 2015-07-07 10:40:26 -  replset (fine): - connection: type: primary, socket: 42, ping: 0, hash: primary_host.lan:27017;rs0;database_name/user/5ca571e7db198eeee3abee35857bfd53;30751     phpmongoclient debug: 2015-07-07 10:40:26 -  replset (fine): selecting near server: done     phpmongoclient debug: 2015-07-07 10:40:26 -  replset (info): pick server: random element 0     phpmongoclient debug: 2015-07-07 10:40:26 -  replset (info): - connection: type: primary, socket: 42, ping: 0, hash: primary_host.lan:27017;rs0;database_name/user/5ca571e7db198eeee3abee35857bfd53;30751     phpmongoclient debug: 2015-07-07 10:40:26 -  con (fine): no timeout changes primary_host.lan:27017;rs0;database_name/user/5ca571e7db198eeee3abee35857bfd53;30751     phpmongoclient debug: 2015-07-07 10:40:30 -  con (fine): no timeout changes primary_host.lan:27017;rs0;database_name/user/5ca571e7db198eeee3abee35857bfd53;30751 update duration: 3943ms update duration: 3476ms update duration: 2008ms update duration: 961ms update duration: 956ms update duration: 20ms update duration: 20ms update duration: 3ms update duration: 42ms update duration: 24ms update duration: 25ms update duration: 56ms update duration: 24ms update duration: 11ms update duration: 11ms update duration: 3ms update duration: 2ms update duration: 3ms update duration: 1ms update duration: 1ms update duration: 1ms update duration: 2ms 

mongo information:

  • mongo version: 3.0.3
  • replica set 1 slave , 1 arbiter
  • replication lag varies between 0 , 4 seconds
  • engine: wiredtiger
  • filesystem: xfs
  • operative system: red hat enterprise linux server release 7.1
  • memory: 24gb. reported htop 40% used, 60% cache

this issue has disappeared. 2 actions undertaken:

  1. reworked pre-aggregated reports system. workload of mongo reduced 10 times.
  2. updated version of mongo 3.0.6

unfortunately, 2 changes put online no time in between them. have suspicion reducing workload did trick (which might or might not linked issue @steve-brisk pointed out) @ least (we'll see when hit again previous workload levels). but, version updated, might previous levels of workload might not hit issue again

i have no evidence pointing 1 of 2 solutions alone. after taking both of them, issue resolved


Comments

Popular posts from this blog

How to provide Authorization & Authentication using Asp.net, C#? -

toolbar - How to add link to user registration inside toobar in admin joomla 3 custom component -

How to use Authorization & Authentication in Asp.net, C#? -