--
You received this message because you are subscribed to the Google Groups "CDAP User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cdap-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cdap-user/53906255-71ea-4f30-b55a-45ff3f887392n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cdap-user/8336ba77-8905-4d50-aaf3-c7943aaef9fen%40googlegroups.com.
RICH
To view this discussion on the web visit https://groups.google.com/d/msgid/cdap-user/CAOtBAghC-_cu7SCADZbrxxdJwEUYmfnodjLibd-DH%2BuUO_kx8w%40mail.gmail.com.
THANKS,
RICH
To view this discussion on the web visit https://groups.google.com/d/msgid/cdap-user/da6ad606-d3ed-41c5-8148-9483fb4a7f4dn%40googlegroups.com.
Hi there!
The only effect after reducing fetch size is that retrieving data takes longer (or at least that's what the counts shows on the UI). It doesn't solve the issue; it simply postpones the first time error. A bunch of errors like these ones are present on the logs, and the pipe end up crashing after a while:
ERROR [Executor task launch worker for task 12:0.a.s.e. Executor@91] Exception in task 12.0 in stage 0.0 (TID 12) java.lang.IllegalArgumentException: Size exceeds Integer.MAX VALUE
ERROR [shuffle-server-5-2:0.a.s.n.s.TransportRequestHandler@127]-Error opening block StreamchunkId(streamId=2062307135000, chunkIndex=0} for request from /x.x.x.x:58724
There is repartitioner after Teradata plugin and a second one prior inserting into BQ (both with shuffle = False). The counts are increased up to wrangler node, but it doesn't go forth on the downstream (or at least not displayed in the UI). Wrangler simply makes lower-case al labels for the columns, but there is no real transformation in the data.
@Rich, statistics has been updated but I can’t appreciate any significant performance increase.
@Albert, you were talking that the data is streamed through, but in the wrangler I see a jump of counts that doesn’t go forth. The recipe on the wrangler is only about putting lower-case the labels from the columns. Do you know if there might be shuffling and where? also if there is a chance to deactivate it?
Regards, Daniel
Hello again,
I’ve leaved yesterday running a pipe without the repartitoner (just in case it was making some noise). And defined the number of spark.executor.instances = 16, also equal to the number of splits in teradata plugin. It took about 10 hours to run but at least was “successfully” over.
Checking the counts on that run are quite higher (238millions) than the total counts of the source table (18.34millions). I was wondering if it could be some kind of issue with the counts in CDAP, but those 238M recs were actually inserted on the BigQuery -table, please check attached pics.
Is there any open bug on the Teradata plugin v1.7.0?
Not sure if the issue could be related with the spark.executor.instances, the size of the source table or the number of splits greater than 1, but the source teradata plugin seems to be duplicating records.
I’m going to execute some more runs in order to see what param correlates with the number of duplicates.
Regards, Daniel
Hi Albert,
Still without clue of what could be happening...
I was modifying executor instances and splits so they match in number. But the param that multiplicate the records it’s actually the number of splits in Teradata plugin. So in former email I’ve pointed out spark.executor.instances but I really wanted to say the number of splits in Teradata plugin.
Please find different variations of both import and bounding queries. Duplicates/multiplication of records happen with all of them as soon as the number of splits is greater than 1:
I haven’t found on the database-plugins code how the record boundaries are given to each split, since that’s actually delegated to the MapReduce libraries. And on CDAP 6.7.1 under the “engine config” the MapReduce is marked as deprecated. Do you know if that might be causing the issue, so there is no proper parallelization, and each executor is reading and propagating the whole table?
Regards, Daniel
--
You received this message because you are subscribed to a topic in the Google Groups "CDAP User" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cdap-user/5BfMDjLLFSc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cdap-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cdap-user/8855d131-2490-4a41-b133-8d5113c1578an%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cdap-user/AS8P193MB1208D709E66E6AB31EF2E07DD3E5A%40AS8P193MB1208.EURP193.PROD.OUTLOOK.COM.
Hi Richard,
Do you remember where did you read that?
I can’t say that the problem it’s constrained to clustered tables with string primary key, cause the pipe also generates duplicates with date/datetime-partitioned tables.
Regards, Daniel
To view this discussion on the web visit https://groups.google.com/d/msgid/cdap-user/CAG4sn1xXNCMNfg3pQr43%2BanbB7tOq0fabDekNSOfLBdf_ewpDA%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cdap-user/AS8P193MB1208D6E7A7E34215462B1AFDD3E5A%40AS8P193MB1208.EURP193.PROD.OUTLOOK.COM.
Hi guys,
thanks for that info! The ticket you've mentioned seems to be 7 yo. Do you know if the issue is still unresolved?
I think it would be worth to mention in teradata plugin documentation that if split-field is of type string, the things can go wild...
For my understanding and if I don't want to leave the transfer speed aside, the only workaround would be to create an additional numeric field that can be used for making the splits, right? What in the other hand for thousands of tables won't be that practical...
Regards, Daniel
To view this discussion on the web visit https://groups.google.com/d/msgid/cdap-user/CAOtBAghiGuL9aj790b-LkwzJ9EPOAweWn2V31j_xqLUJJzGu5A%40mail.gmail.com.