--
You received this message because you are subscribed to the Google Groups "pycassa-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pycassa-discu...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
However when I try to do multiget on 1 million rowkeys it fails with a message as "Retried 6 times. Last failure was timeout: timed out" .e.g: colfam.multiget([rowkey1,...........,rowkey_Million])
--
You received this message because you are subscribed to a topic in the Google Groups "pycassa-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/pycassa-discuss/tkxOteA6ffI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to pycassa-discu...@googlegroups.com.
pycassa.cassandra.ttypes.InvalidRequestException: InvalidRequestException(why="start key's token sorts after end key's token. this is not allowed; you probably should not specify end key at all excep ... (truncated)
The key range begins with start and ends with finish. If left as empty strings, these extend to the beginning and end, respectively. Note that if RandomPartitioner is used, rows are stored in the order of the MD5 hash of their keys, so getting a lexicographical range of keys is not feasible.
--
You received this message because you are subscribed to the Google Groups "pycassa-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pycassa-discu...@googlegroups.com.
As I am currently using Random Partitioner could you suggest any suitable approach for my problem scenario.As you know I have 1 million row keys , initially I had a plan of fetching data from a range of rowkeys using get_range(start=rowkey1,finish=rowkeyN) then do some operation on the data and then bulk insert the new data into a new cassandra column family. Repeat the same thing for different range of rowkeys until all of 1million rowkeys have been read.Since get_range(start,end) does not work for my scenario because of Random Partitioner , I was wondering if fetching 1 million rowkeys using get_range() at once and then doing bulk insert into new cassandra table will be efficient enough without seeing the below timout message.
<'pycassa.pool.MaximumRetryException'>Retried 6 times. Last failure was timeout: timed out.In otherwords how do I slice a set of rowkeys so as to perform operations and then do bulk insert of the new data into a new cassandra column family.Also I am worried for large data in future for more than a million rowkeys.
--
You received this message because you are subscribed to the Google Groups "pycassa-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pycassa-discu...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
The approach what I have taken is that these two processes(A & B) create two separate connection pools to connect to same cassandra instance for reading the CF's A & B.However the performance seems to be the same even after using multiprocessing approach. In other words with and without multiprocessing approach the program runs for about 2hrs for 1million data points in each of CF A & B.So my doubt is if I am using the right python multiprocessing function.
I have done line profiling of my bulk_insert function and found out that batch insert statement is taking 92% of total time taken to execute this function.(shaded in yellow below)
Also, the batch insert is NOT happening in terms of queue_size specified say 5000. Batch insert happens after the outer for loop exits say after 450,000.
As far as I know inserting 2 million rows into cassandra should not take more than 3-4 minutes on a high end machine with 12 cores.
Also, the batch insert is NOT happening in terms of queue_size specified say 5000. Batch insert happens after the outer for loop exits say after 450,000.
I have set up a counter as you can see in my code " bulkcount = bulkcount + 1" . Based on this I am able to get to this number.
Also I am worried that batch insert is taking a lot of time . For instance for processing 2 million rows it is taking close to 10 mins rather than a minute or two and I find that bulk insert is taking most of the time as shown in the line profiling.