Crash every hour in production under load (windows,1.8.2)

37 views
Skip to first unread message

Braden

unread,
Jul 9, 2011, 5:51:04 PM7/9/11
to mongodb-user
Mongo 1.8.2
2x servers running on windows 2008 with 24gb ram

Getting a crash about every hour

Sat Jul 09 14:38:21 [conn1091] MapViewOfFileEx failed d:/data/local/
local.ns errno:487 Attempt to access invalid address.
Sat Jul 09 14:38:21 [conn1091] [Our Collection] Assertion failure p db
\mongommf.cpp 198
Sat Jul 09 14:38:21 [conn1091] [Our Collection] query: { uid:
ObjectId('4e18a2be00965b3f28ebfb5d') } exception 0 assertion db
\mongommf.cpp:198 494676ms


We are also experiencing a strange problem under load where windows
claims 15+gb of ram for cache while mongostat shows ~2gb res - feels
like it could be related as mongo is probably under memory pressure

We are under load, 2000 queries 30inserts /sec

Scott Hernandez

unread,
Jul 9, 2011, 6:26:25 PM7/9/11
to mongod...@googlegroups.com
Can you run with --vvvvv so there is more logging output.

On Sat, Jul 9, 2011 at 5:51 PM, Braden <xzinv...@gmail.com> wrote:
> Mongo 1.8.2
> 2x servers running on windows 2008 with 24gb ram

What does 2x mean in this context?

> Getting a crash about every hour
>
> Sat Jul 09 14:38:21 [conn1091] MapViewOfFileEx failed d:/data/local/
> local.ns errno:487 Attempt to access invalid address.
> Sat Jul 09 14:38:21 [conn1091] [Our Collection] Assertion failure p db
> \mongommf.cpp 198
> Sat Jul 09 14:38:21 [conn1091] [Our Collection]  query: { uid:
> ObjectId('4e18a2be00965b3f28ebfb5d') } exception 0 assertion db
> \mongommf.cpp:198 494676ms
>

Is it possible that you have corrupt data in your database? Are you
running with journaling?


>
> We are also experiencing a strange problem under load where windows
> claims 15+gb of ram for cache while mongostat shows ~2gb res - feels
> like it could be related as mongo is probably under memory pressure

Also, the cache memory maps to the memory mapped files for the data
and resident memory is other stuff (stack, data structures, etc). They
are not related if that is what you are suggesting.

> We are under load, 2000 queries 30inserts /sec

Can you collect mongostat data as well?

> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>
>

Braden

unread,
Jul 9, 2011, 7:02:09 PM7/9/11
to mongodb-user

We are running with --journal, and have been since the start, it has
crashed a bunch of times since so I am not sure about the consistency
of the database anymore.

We are following another thread now, it looks like we have a rouge
query doing a HUGE update, since the crash is trying to access the
local db I wonder if this is oplog related? The local db is 1gb, I
wonder if we should recreate the opload to be much larger?


Here is our mongostat
10 507 25 0 10 59 0 150g 300g
5.86g 54.8 0 5|11 100|37 75k 1m 211
wireclub M 15:57:34
123 993 65 0 20 146 0 150g 300g
5.59g 52.9 0 49|11 101|35 175k 2m 211
wireclub M 15:57:35
111 1110 59 0 21 137 0 150g 300g
5.65g 50.9 0 3|11 96|35 373k 1m 173
wireclub M 15:57:36
0 414 19 0 10 38 0 150g 300g
5.75g 12.7 0 183|37 183|37 46k 674k 234
wireclub M 15:57:37
109 311 31 0 6 58 0 150g 300g
5.72g 31.7 0 65|27 97|35 199k 701k 234
wireclub M 15:57:38
insert query update delete getmore command flushes mapped vsize
res locked % idx miss % qr|qw ar|aw netIn netOut conn set
repl time
132 1098 75 1 27 178 0 150g 300g
5.51g 58.1 0 72|15 120|36 209k 2m 234
wireclub M 15:57:39
131 607 25 1 15 50 0 150g 300g
5.89g 52 0 48|17 100|34 130k 2m 232
wireclub M 15:57:40
8 448 19 0 23 54 0 150g 300g
6g 55.5 0 37|26 96|35 106k 2m 232
wireclub M 15:57:41
120 812 50 0 17 96 0 150g 300g
6.15g 58.9 0 55|23 95|34 185k 2m 232
wireclub M 15:57:42
141 1012 80 1 20 149 0 150g 300g
6.27g 55.7 0 52|23 97|33 180k 2m 231
wireclub M 15:57:43
8 622 14 0 5 57 0 150g 300g
6.27g 52.1 0 186|34 186|34 92k 997k 240
wireclub M 15:57:44
221 1012 80 0 12 169 0 150g 300g
5.74g 51.6 0 68|37 104|42 230k 1m 240
wireclub M 15:57:45
25 462 30 0 5 80 0 150g 300g
5.96g 11.9 0 181|39 182|39 70k 930k 240
wireclub M 15:57:46
113 184 18 0 6 30 0 150g 300g
5.93g 17.3 0 5|2 100|33 72k 815k 240
wireclub M 15:57:47
154 1062 79 1 27 185 0 150g 300g
5.98g 49.7 0 20|11 107|33 243k 19m 240
wireclub M 15:57:48
insert query update delete getmore command flushes mapped vsize
res locked % idx miss % qr|qw ar|aw netIn netOut conn set
repl time
12 1155 22 0 18 43 0 150g 300g
6.1g 51.8 0 4|8 105|34 149k 21m 240
wireclub M 15:57:49
113 769 12 0 14 17 0 150g 300g
6.15g 58 0 65|30 104|33 461k 18m 239
wireclub M 15:57:50
26 733 39 0 17 88 0 150g 300g
6.2g 56.7 0 0|1 92|33 234k 17m 239
wireclub M 15:57:51
227 952 64 2 19 162 0 150g 300g
6.15g 48.4 0 137|36 137|37 194k 3m 238
wireclub M 15:57:52
28 1185 36 1 14 91 0 150g 300g
6.29g 52.8 0 47|15 120|34 166k 2m 238
wireclub M 15:57:53
141 1123 55 1 15 117 0 150g 300g
6.2g 53.8 0 85|21 96|33 320k 2m 237
wireclub M 15:57:54
112 591 21 0 9 55 0 150g 300g
6.23g 55.8 0 16|13 91|33 99k 1m 237
wireclub M 15:57:55
122 1087 47 0 34 111 0 150g 300g
6.42g 54.4 0 26|16 103|33 207k 3m 237
wireclub M 15:57:56
12 1780 44 0 31 77 0 150g 300g
6.55g 49.3 0 83|19 105|32 227k 4m 237
wireclub M 15:57:57
117 707 57 2 13 103 0 150g 300g
6.19g 54 0 43|22 93|32 215k 3m 237
wireclub M 15:57:58
insert query update delete getmore command flushes mapped vsize
res locked % idx miss % qr|qw ar|aw netIn netOut conn set
repl time
5 291 8 0 4 25 0 150g 300g
6.13g 11.8 0 161|35 162|35 25k 817k 237
wireclub M 15:57:59
122 718 60 1 15 149 0 150g 300g
5.87g 45.5 0 8|17 100|34 243k 1m 237
wireclub M 15:58:00
114 675 34 0 9 89 0 150g 300g
6.04g 49.9 0 62|13 118|33 120k 2m 238
wireclub M 15:58:01
125 903 53 2 18 131 0 150g 300g
6.04g 51.4 0 29|13 99|34 184k 3m 238
wireclub M 15:58:02
14 903 28 0 7 73 0 150g 300g
6.28g 49.5 0 65|29 101|33 124k 3m 238
wireclub M 15:58:03
135 671 32 1 19 83 0 150g 300g
6.46g 52.3 0 0|1 96|33 613k 3m 238
wireclub M 15:58:04
119 740 67 3 10 165 0 150g 300g
6.32g 55.5 0 62|19 96|33 148k 2m 238
wireclub M 15:58:05
111 473 26 0 3 62 0 150g 300g
6.53g 51.2 0 68|20 90|32 88k 1m 238
wireclub M 15:58:06
145 1226 122 2 28 254 0 150g 300g
6.58g 51 0 51|18 114|33 229k 3m 238
wireclub M 15:58:07
140 788 34 1 8 90 0 150g 300g
6.44g 20.7 0 166|32 167|32 118k 1m 238
wireclub M 15:58:08
insert query update delete getmore command flushes mapped vsize
res locked % idx miss % qr|qw ar|aw netIn netOut conn set
repl time
0 40 4 0 0 4 0 150g 300g
6.46g 0 0 169|37 170|36 2k 19k 245
wireclub M 15:58:09
0 2 0 0 0 1 0 150g 300g
6.49g 0 0 172|37 172|36 62b 1k 247
wireclub M 15:58:11
6 459 32 0 12 60 0 150g 300g
5.77g 39.2 0 61|22 99|33 105k 1m 245
wireclub M 15:58:12
241 1324 105 2 16 265 0 150g 300g
5.66g 52.3 0 88|22 116|34 665k 3m 244
wireclub M 15:58:13
127 892 62 0 16 129 0 150g 300g
5.83g 52 0 60|17 100|33 596k 3m 253
wireclub M 15:58:14
109 789 16 0 19 30 0 150g 300g
5.93g 49.8 0 56|20 100|34 437k 3m 253
wireclub M 15:58:15
16 719 34 2 13 96 0 150g 298g
6.02g 51 0 69|29 91|33 141k 2m 253
wireclub M 15:58:16
103 460 22 0 6 45 0 150g 300g
6.03g 53.4 0 32|18 93|33 98k 1m 253
wireclub M 15:58:17
33 534 33 1 11 70 0 150g 300g
6.4g 46.7 0 13|9 93|37 89k 2m 253
wireclub M 15:58:18
228 1193 79 0 20 194 0 150g 300g
6.32g 50.4 0 24|15 96|37 288k 3m 252
wireclub M 15:58:19
insert query update delete getmore command flushes mapped vsize
res locked % idx miss % qr|qw ar|aw netIn netOut conn set
repl time
113 710 44 0 12 118 0 150g 300g
6.16g 57.8 0 68|27 109|35 152k 2m 252
wireclub M 15:58:20
119 854 53 0 20 128 0 150g 300g
6.22g 54.1 0 111|32 111|33 218k 2m 252
wireclub M 15:58:21


On Jul 9, 3:26 pm, Scott Hernandez <scotthernan...@gmail.com> wrote:
> Can you run with --vvvvv so there is more logging output.
>

Scott Hernandez

unread,
Jul 9, 2011, 7:22:03 PM7/9/11
to mongod...@googlegroups.com
Can you create a jira issue with all the current info and all new
info: http://jira.mongodb.org

On Sat, Jul 9, 2011 at 7:02 PM, Braden <xzinv...@gmail.com> wrote:
>
> We are running with --journal, and have been since the start, it has
> crashed a bunch of times since so I am not sure about the consistency
> of the database anymore.

Have you always been running 1.8.2 or did you run with a journal in
earlier versions?

> We are following another thread now, it looks like we have a rouge
> query doing a HUGE update, since the crash is trying to access the
> local db I wonder if this is oplog related? The local db is 1gb, I
> wonder if we should recreate the opload to be much larger?

1GB seems like a pretty small oplog. You can run
db.getReplicationInfo() to get an idea of how much time is in your
oplog. I would recommend keeping enough of an oplog for 2-3 times how
long it would take to deploy a new replica in case of a failure.

Braden

unread,
Jul 9, 2011, 7:51:26 PM7/9/11
to mongodb-user
https://jira.mongodb.org/browse/SERVER-3403

This database is only 3 days old, always 1.8.2 and always --journal

I will resize the oplog, only an hour not even close to enough time to
resync

Thanks.

On Jul 9, 4:22 pm, Scott Hernandez <scotthernan...@gmail.com> wrote:
> Can you create a jira issue with all the current info and all new
> info:http://jira.mongodb.org
>
> ...
>
> read more »
Reply all
Reply to author
Forward
0 new messages