Why would a NodeJS app give bad data when it is under stress?

lawrence...@gmail.com

unread,

May 3, 2017, 5:55:59 PM5/3/17

to nodejs

No matter how much stress a NodeJS app is subjected to, it should only offer 2 responses:

1.) the correct data

2.) a SocketTimeout exception

A NodeJS app which sends back bad data is useless.

I'm trying to find an answer for the problem described here:

http://www.smashcompany.com/technology/a-surprising-nodejs-failure-mode-deterministic-code-becomes-probabilistic-under-load

Any thoughts on why this might happen when the load is high enough?

Zlatko

unread,

May 4, 2017, 9:53:23 PM5/4/17

to nodejs

Any thoughts on why this might happen when the load is high enough?

Interesting read! I would love to see that code, this way I can only speculate that even though you think your code is all synchronous, it is, in fact, not. Perhaps one of the functions or libraries has an implicit timeout where it then fails silently?

Or maybe you've exhausted the heap but hit some yet undiscovered V8 heap edge case where the it can get mixed up if your contexts get shuffled around a lot. I'm thinking - you allocate a certain amount of memory for those 4 GB of data, but if you're bombing the API, Node might be holding thousands extra sockets open which is taking another chunk of memory that you didn't account for - and than one of your scoring functions can't allocate enough to work?

Did you test this with always hitting the API with the same example? Does it get confused in that case too?

Anyway, I don't know how viable it is, but maybe hooking up Chrome dev tools and setting a breakpoint at those filters and searches could shine some light.

DaneiL

unread,

May 4, 2017, 9:53:23 PM5/4/17

to nod...@googlegroups.com

Hello, i read your entire case, it seems to me, that its not related with node itself, but with hapijs.

Your statement made me reckon a study case made by Netflix.
That you can see here:
https://www.infoq.com/news/2014/12/expressjs-burned-netflix

I think it should give you some insights of what is happening.

That's all

--
Job board: http://jobs.nodejs.org/
New group rules: https://gist.github.com/othiym23/9886289#file-moderation-policy-md
Old group rules: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
---
You received this message because you are subscribed to the Google Groups "nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nodejs+unsubscribe@googlegroups.com.
To post to this group, send email to nod...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nodejs/a02a9f1f-ef00-4049-8669-b70a00bf9457%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

[]'s

DaneiL

unread,

May 4, 2017, 9:53:24 PM5/4/17

to nod...@googlegroups.com

Here is a better link: Sorry
https://medium.com/netflix-techblog/node-js-in-flames-ddd073803aa4

--

[]'s

Witold Szczerba

unread,

May 4, 2017, 9:53:38 PM5/4/17

to nod...@googlegroups.com

So you have a bug in your application. That is sad, but the "My app misbehaves, NodeJS is guilty!" grief, 3 pages long won't help here. There is no source code attached.

Did you actually try to figure out what is wrong? Do you have any lead, anything?

lawrence...@gmail.com

unread,

May 6, 2017, 6:32:15 PM5/6/17

to nodejs

I like this idea:

> Perhaps one of the functions or libraries has an implicit timeout where it then fails silently?

but I can't figure out what library that would be. There aren't many libraries in use here, other than HapiJS, and Jaro-Winkler.

This is a very simple app, less than 500 lines of code. When it starts, it loads some data and holds it in memory, using memory as a cache, and when a request comes in, it does some string matching between the input from the outside world and the data from the database.

About this:

> Node might be holding thousands extra sockets open which is taking another

> chunk of memory that you didn't account for - and than one of your scoring

> functions can't allocate enough to work?

I am fairly sure this is happening, but I would expect NodeJS to throw something like a SocketTimeout exception or some other kind of Timeout error. The bizarre thing is that the functions clearly do get enough memory to form a string and convert it to JSON. Again, I am not surprised when I get a SocketTimeout exception, nor am I surprised when I get the correct data, I am only surprised when I get a JSON response that does not have the correct data.

If this was Java then I would know the problem with absolute precision: one thread got a lock on the global variable holding the data, another thread wanted the data but was blocked, the blocked thread eventually timesout its request, and then that thread continues as if the response from the global variable had been null. And in Java I'd have ways of dealing with the situation.

In Javascript, a lot is happening under the hood, but I don't know if it has internal timeouts, or how to control them.

Here is the main function. As you can see, it is pretty much straight procedural code.

function make_relevant_comparisons (document_from_api_call) {

var preliminary_api_call_meta = document_from_api_call['api_call_meta'];

if (!preliminary_api_call_meta) {

preliminary_api_call_meta = {};

}

var api_call_meta = get_api_call_info(preliminary_api_call_meta, document_from_api_call);

var comparison_id = uuidV4();

var all_worthwhile_comparisons = {};

all_worthwhile_comparisons.records = {};

all_worthwhile_comparisons.highest_score = 0;

all_worthwhile_comparisons.context = {};

all_worthwhile_comparisons.context.comparison_id = comparison_id;

all_worthwhile_comparisons.context.api_call_meta = api_call_meta;

all_worthwhile_comparisons.context.errors = check_if_api_client_used_allowed_fields(document_from_api_call, preliminary_api_call_meta);

all_worthwhile_comparisons.context.how_many_documents_were_searched = 0;

all_worthwhile_comparisons.context.how_many_documents_were_returned_after_all_root_endpoint_filters = 0;

all_worthwhile_comparisons.context.which_root_endpoint_filters_passed = [];

delete document_from_api_call['api_call_meta'];

// 2017-03-02 -- api_key_whose_name_should_match_the_name_of_a_database_field is a key we pull from

// the api call and it should exactly match the name of some field in the database. for instance,

// headquarters_addr1 is a field in the database, and can also be a field in the API document

Object.keys(document_from_api_call).forEach(function(api_key_whose_name_should_match_the_name_of_a_database_field) {

var string_from_api_call = document_from_api_call[api_key_whose_name_should_match_the_name_of_a_database_field];

// 2017-03-02 -- This variable:

//

// Data_loaded_at_startup.map_of_sanitized_words_to_uuid_ids

//

// is built up in:

//

// map_of_sanitized_words_to_uuid_ids[name_of_field_in_database][sanitized_word][uuid_id] = uuid_id;

//

var map_of_uuid_ids_which_can_be_used_to_find_values_which_contain_the_target_word = get_map_of_uuid_ids_which_can_be_used_to_find_values_which_contain_the_target_word(api_key_whose_name_should_match_the_name_of_a_database_field, string_from_api_call);

Object.keys(map_of_uuid_ids_which_can_be_used_to_find_values_which_contain_the_target_word).forEach(function(uuid_id) {

// 2017-03-02 -- this variable is built up in load.js like this:

//

// map_of_uuid_ids_and_field_names_to_an_array_of_field_values[uuid_id][name_of_field_in_database_from_key_in_document] = [ document[name_of_field_in_database_from_key_in_document] ];

var document_from_database = Data_loaded_at_startup.map_of_uuid_ids_and_field_names_to_an_array_of_field_values[uuid_id];

var influence_from_page_views_rank = 0;

all_worthwhile_comparisons.context.how_many_documents_were_searched = all_worthwhile_comparisons.context.how_many_documents_were_searched + 1;

var map_of_one_match_on_one_field_from_this_document = compare_two_fields(document_from_api_call, document_from_database, api_key_whose_name_should_match_the_name_of_a_database_field, api_call_meta);

// console.log('false_or_map_of_one_match_on_one_field_from_this_document');

// console.log(false_or_map_of_one_match_on_one_field_from_this_document);

//

// { type_of_match: 'exact_match',

// field: 'profile_name',

// santized_string_from_api_call: 'atlassian corporation plc',

// santized_string_from_database: 'atlassian corporation plc',

// score: 57 }

if (map_of_one_match_on_one_field_from_this_document) {

if (map_of_one_match_on_one_field_from_this_document.type_of_match != 'exact_match'

&& map_of_one_match_on_one_field_from_this_document.type_of_match != 'jaro_winkler') {

console.log( "Error in make_relevant_comparisons ::::::: type_of_match must be either 'exact_match' or 'jaro_winkler'. Instead, it was: " + type_of_match);

throw "Error in make_relevant_comparisons ::::::: type_of_match must be either 'exact_match' or 'jaro_winkler'. Instead, it was: " + type_of_match;

} else {

var field_name = map_of_one_match_on_one_field_from_this_document['field'];

if (!all_worthwhile_comparisons.records[uuid_id]) {

all_worthwhile_comparisons.records[uuid_id] = {};

}

if (!all_worthwhile_comparisons.records[uuid_id][field_name]) {

all_worthwhile_comparisons.records[uuid_id][field_name] = {};

}

if (!all_worthwhile_comparisons.records[uuid_id]['total_score']) {

all_worthwhile_comparisons.records[uuid_id]['total_score'] = 0;

}

// if we search for "Atlassian" we get 10 matches for the field 'profile_name'

//

// "profile_name": [

// "Atlassian Corporation Plc",

// "Atlassian (Australia) Limited",

// "Atlassian (Global) Limited",

// "Atlassian (UK) Limited",

// "Atlassian Australia 1 Pty Ltd.",

// "Atlassian Australia 2 Pty Ltd.",

// "Atlassian Corporation Pty. Ltd.",

// "Atlassian LLC",

// "Atlassian Pty Ltd",

// "Atlassian, Inc",

// "HipChat",

// "MITT Trust",

// "StatusPage.io",

// "Trello, Inc."

// ],

all_worthwhile_comparisons['records'][uuid_id]['total_score'] = all_worthwhile_comparisons['records'][uuid_id]['total_score'] + map_of_one_match_on_one_field_from_this_document.score;

all_worthwhile_comparisons['records'][uuid_id][field_name] = map_of_one_match_on_one_field_from_this_document;

}

// 2017-03-02 -- using page_views_rank to gain a signal is an idea I'm taking from the old PHP code

// but I have not yet tested it for statistical rigor. Dividing by 1000 is my prior. Totally arbitrary.

if (document_from_database['page_views_rank']) {

var array_of_page_views_rank = document_from_database['page_views_rank'];

var page_views_rank = parseInt(array_of_page_views_rank[0], 10);

if (page_views_rank > 1000) {

influence_from_page_views_rank = Math.ceil(page_views_rank / 1000);

}

all_worthwhile_comparisons.records[uuid_id]['total_score'] = all_worthwhile_comparisons.records[uuid_id]['total_score'] + influence_from_page_views_rank;

all_worthwhile_comparisons.records[uuid_id]['influence_from_page_views_rank'] = influence_from_page_views_rank;

all_worthwhile_comparisons.records[uuid_id]['document_from_database'] = document_from_database;

// 2017-03-06 -- its a crime against good code, but to improve performance, I'm removing all the

// extra loops that existed in functions such as is_there_a_match_on_the_headquarters_state_code,

// and I'm doing everything here in this one loop.

var headquarters_country_code = api_call_meta['headquarters_country_code'];

if (headquarters_country_code) {

var headquarters_country_code_from_the_database = all_worthwhile_comparisons['records'][uuid_id]['document_from_database']['headquarters_country_code'];

if (headquarters_country_code_from_the_database == headquarters_country_code) {

// 2017-03-05 -- if a document matches the country, then we should boost its score, to help it

// meet the threshold for best_match and close_match. I'm going to hard code this for now

// as worth 30% of an exact_match.

all_worthwhile_comparisons.context.which_root_endpoint_filters_passed.push('headquarters_country_code');

all_worthwhile_comparisons['records'][uuid_id]['total_score'] = all_worthwhile_comparisons['records'][uuid_id]['total_score'] + api_call_meta['bonus_score_for_passing_a_filter'];

} else {

delete all_worthwhile_comparisons.records[uuid_id];

}

// 2017-03-06 -- its a crime against good code, but to improve performance, I'm removing all the

// extra loops that existed in functions such as is_there_a_match_on_the_headquarters_state_code,

// and I'm doing everything here in this one loop.

var headquarters_state_code = api_call_meta['headquarters_state_code'];

if (headquarters_state_code) {

// 2017-03-06 -- we check to see if maybe we just deleted this

if (all_worthwhile_comparisons['records'][uuid_id]) {

var headquarters_state_code_from_the_database = all_worthwhile_comparisons['records'][uuid_id]['document_from_database']['headquarters_state_code'];

if (headquarters_state_code_from_the_database == headquarters_state_code) {

// 2017-03-05 -- if a document matches the state, then we should boost its score, to help it

// meet the threshold for best_match and close_match. I'm going to hard code this for now

// as worth 30% of an exact_match.

all_worthwhile_comparisons.context.which_root_endpoint_filters_passed.push('headquarters_state_code');

all_worthwhile_comparisons['records'][uuid_id]['total_score'] = all_worthwhile_comparisons['records'][uuid_id]['total_score'] + api_call_meta['bonus_score_for_passing_a_filter'];

} else {

delete all_worthwhile_comparisons.records[uuid_id];

}

// 2017-03-06 -- its a crime against good code, but to improve performance, I'm removing all the

// extra loops that existed in functions such as is_there_a_match_on_the_headquarters_state_code,

// and I'm doing everything here in this one loop.

var type_of_profile = api_call_meta['type_of_profile'];

// if the client sent us a type_of_profile, we have to make sure it is present in any document we send back

if (type_of_profile) {

// 2017-03-06 -- we check to see if maybe we just deleted this

if (all_worthwhile_comparisons['records'][uuid_id]) {

if (!Array.isArray(type_of_profile)) {

all_worthwhile_comparisons.context.errors.push({ "type_of_profile_error" : "type_of_profile must be an array"});

} else {

var documents_type_of_profile = all_worthwhile_comparisons['records'][uuid_id]['document_from_database']['type_of_profile'][0];

// type_of_profile can be "company" or "advisor" or "investor"

if (documents_type_of_profile == type_of_profile[0] ||

documents_type_of_profile == type_of_profile[1] ||

documents_type_of_profile == type_of_profile[2]) {

// 2017-03-05 -- if a document matches, then we should boost its score, to help it

// meet the threshold for best_match and close_match. I'm going to hard code this for now

// as worth 30% of an exact_match.

var type_of_profile_as_string = type_of_profile.toString();

all_worthwhile_comparisons.context.which_root_endpoint_filters_passed.push('type_of_profile: ' + type_of_profile_as_string);

all_worthwhile_comparisons['records'][uuid_id]['total_score'] = all_worthwhile_comparisons['records'][uuid_id]['total_score'] + api_call_meta['bonus_score_for_passing_a_filter'];

} else {

console.log("deleting a record from the return data, because documents_type_of_profile " + documents_type_of_profile + " did not match this api_call type_of_profile: ");

console.log(type_of_profile);

delete all_worthwhile_comparisons.records[uuid_id];

}

if (all_worthwhile_comparisons['records'][uuid_id]) {

if (all_worthwhile_comparisons['records'][uuid_id]['total_score'] > all_worthwhile_comparisons.highest_score) {

all_worthwhile_comparisons.highest_score = all_worthwhile_comparisons['records'][uuid_id]['total_score'];

}

} // if (map_of_one_match_on_one_field_from_this_document) {

}) // Object.keys(map_of_uuid_ids_which_can_be_used_to_find_values_which_contain_the_target_word).forEach(function(uuid_id) {

}); // Object.keys(document_from_api_call).forEach(function(api_key_whose_name_should_match_the_name_of_a_database_field) {

all_worthwhile_comparisons = are_required_fields_present(api_call_meta, all_worthwhile_comparisons);

var how_many_documents_were_returned_after_all_root_endpoint_filters = Object.keys( all_worthwhile_comparisons.records ).length;

all_worthwhile_comparisons.context.how_many_documents_were_returned_after_all_root_endpoint_filters = how_many_documents_were_returned_after_all_root_endpoint_filters;

return all_worthwhile_comparisons;

lawrence...@gmail.com

unread,

May 6, 2017, 6:32:15 PM5/6/17

to nodejs

Danizavtz, that is an interesting read. Thank you for sharing it. But again, because I deliberately bombard the app with more than it can handle, I am expecting it to fail often. And when it fails, I expect to get a SocketTimeout exception, or some other kind of timeout error. So it seems reasonable that the app becomes overwhelmed and then fails -- that is exactly what I am trying to do. But again, I expect only one of two responses:

1.) the app sends the correct data formatted as JSON

2.) the app throws an exception

What doesn't make sense to me is:

3.) the app sends correctly formatted JSON, but the actual data from the memory cache is missing

On Thursday, May 4, 2017 at 9:53:24 PM UTC-4, Danizavtz wrote:

Here is a better link: Sorry
https://medium.com/netflix-techblog/node-js-in-flames-ddd073803aa4

On 4 May 2017 at 09:28, DaneiL <dani...@gmail.com> wrote:

Hello, i read your entire case, it seems to me, that its not related with node itself, but with hapijs.
Your statement made me reckon a study case made by Netflix.
That you can see here:
https://www.infoq.com/news/2014/12/expressjs-burned-netflix

I think it should give you some insights of what is happening.
That's all

On 3 May 2017 at 18:27, <lawrence...@gmail.com> wrote:

No matter how much stress a NodeJS app is subjected to, it should only offer 2 responses:

1.) the correct data

2.) a SocketTimeout exception

A NodeJS app which sends back bad data is useless.

I'm trying to find an answer for the problem described here:

http://www.smashcompany.com/technology/a-surprising-nodejs-failure-mode-deterministic-code-becomes-probabilistic-under-load

Any thoughts on why this might happen when the load is high enough?

--
Job board: http://jobs.nodejs.org/
New group rules: https://gist.github.com/othiym23/9886289#file-moderation-policy-md
Old group rules: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
---
You received this message because you are subscribed to the Google Groups "nodejs" group.

To unsubscribe from this group and stop receiving emails from it, send an email to nodejs+un...@googlegroups.com.

To post to this group, send email to nod...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nodejs/a02a9f1f-ef00-4049-8669-b70a00bf9457%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
[]'s

--
[]'s

Juraj Kirchheim

unread,

May 7, 2017, 10:47:22 PM5/7/17

to nod...@googlegroups.com

You should really try to isolate the error.

1. Run the function from within a simple node script and see if it returns the same results consistently. If not, investigate that.

2. Replace the function with something trivial with a busy loop to get roughly the same runtime duration. If you do get the same non-determinism, then the error is somewhere up the stack. If not, it's gonna be tough, because after step one you can be fairly certain that your function is deterministic, so all you can tell is that it probably causes some side-effect that is relevant to the stack above. Like you might be trashing a lot of memory or something and that causes hapi to behave strangely. In this case I would advice to add back more and more code of the old function until the non-determinism shows again and act on what you have learned from that.

3. Assuming that a busy waiting function shows the same non-determinism, the problem would have to be either in node or in hapi. To test that, your best bet is to serve the functionality without any kind of framework. Because you have only one call, it shouldn't be hard to do. You will find out that either node is the problem or that it isn't. The next step would be to report that to your client and also raise an issue with whoever is responsible for the failure by giving them a minimal example which you should have by this point. Then you should make sure to configure nginx in a way that the concurrency never reaches a level that the node app can't handle and start giving back 503 or something. That way you don't serve incorrect responses.

Best,

Juraj

To unsubscribe from this group and stop receiving emails from it, send an email to nodejs+unsubscribe@googlegroups.com.

To post to this group, send email to nod...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/nodejs/c7353e3f-d936-4808-a133-5eebaf1a9723%40googlegroups.com.

lawrence...@gmail.com

unread,

May 9, 2017, 9:31:19 PM5/9/17

to nodejs

Juraj, thanks. That is a good plan.

Reply all

Reply to author

Forward