Issue with Batch mode of gocql

801 views
Skip to first unread message

harry...@gmail.com

unread,
Jan 30, 2014, 12:41:31 AM1/30/14
to go...@googlegroups.com
Hello,

I wrote my first sample program to load a file with about 7,000 lines to cassandra using gocql interface. It is using the NewBatch, batch.Query, ExecuteBatch interface very similar to the unit-test cassandra_test.go within gocql codebase.

The issue is that after a successful load (insert statements) of the whole file, I end up with just 40 rows. That is the result of "select count(*) from foobar" via cqlsh. Now, I do understand it could be a mistake made in the selection of row key, but I have double-checked everything. Moreover, loading again, reducing the 'batch' to a size of 1, I am able to get about 6500 or so rows in the resulting table. 

The table / column family is declared in cassandra as

create table foobar (
ey      ascii,
u       int,
g       ascii,
et      timestamp,  -- COMMENT 'Event Time',
[.. snip ..  around 30+ columns]

PRIMARY KEY ( (ey, g), et )
);

I am following the guide here on the schema design http://www.opensourceconnections.com/2013/07/24/understanding-how-cql3-maps-to-cassandras-internal-data-structure/

Here is what I have been able to troubleshoot / check.

1. rowkey is unique. There is at least 5,000+ unique combinations of (ey + g). There is no way we should end up with 40. These are web logs and (ey + g) identifies a browser. The time component comes in mainly to store multiple events form the same browser.

2. I understand cassandra has special meaning to 'TIMESTAMP' column, but the timestamp is part of data (it is the time the event happened). I also see reference to timestamp and BATCH mode, not sure where to control it.

3. ExecuteBatch does not report any errors.

4. This is a two node cassandra installation. Two hosts, both serving as seeds.

Is there a way to turn on logging to see what goes to cassandra?

Now the golang part of it
----------------------
As I said earlier, I am following the cassandra_test.go example, but of course with minor changes to accommodate the 30+ columns I have. Since not all columns are strings, I add them to
a variable declared this way.

    rowdata := make([]interface{}, len(columns))

Then looping over each column data (from the line that is being processed), I fill row_data with  string / int / float32 / bool as appropriate. Then when I call batch.Query(query, rowdata...)

Apart from using that variable-arguments golang feature, there is not much there. I am new to golang, so I thought I should mention that.

Any help/hints are appreciated. How do I debug further?

--
Harry

Ben Hood

unread,
Jan 30, 2014, 2:06:27 AM1/30/14
to harry...@gmail.com, go...@googlegroups.com
Hey Harry,

I think it might be easier to help you out if you posted an cut down
example gist that reproduces the issue. Then we'll have a better idea
of what's going on.

Cheers,

Ben
> --
> You received this message because you are subscribed to the Google Groups
> "gocql" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to gocql+un...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

harry...@gmail.com

unread,
Jan 30, 2014, 6:09:33 PM1/30/14
to go...@googlegroups.com, harry...@gmail.com
Ben,

I have removed unnecessary code, columns and data and made a small sample.

Schema is here  https://gist.github.com/harrysungod/1d29c364da2d05d1271a

Loader code (Go)  https://gist.github.com/harrysungod/fc7849556339316dd07a

Sample data I was using is here  https://github.com/harrysungod/test01/raw/master/httpd-log.201401291000.sample.zip

Or you can just clone this repo https://github.com/harrysungod/test01

I build it this way

go build -gcflags "-N -l" cassloader2.go

and run it this way

./cassloader2 -event_file=httpd-log.201401291000.sample

(You will need to change the IP address of the cassandra server of course )

Although some of the data was edited, I have confirmed that this particular file run with this code still exhibits the same issue.

When I run it first, I end up with just 4 rows in the table.

If I change to batch size smaller, say 10 rows, I end up with 600+ rows.

cqlsh:logs_ks> select count(*) from sample_logs_cass;

 count
-------
     4

(1 rows)

Thanks
--
Harry

Ben Hood

unread,
Jan 31, 2014, 3:33:22 AM1/31/14
to harry...@gmail.com, go...@googlegroups.com
Harry,

Good news and bad news.

I've run your code and the good news is that I can reproduce your issue.

The bad news is that I can't explain why it is happening.

It seems that the number of rows written to Cassandra is equal to the
number of times that the batch threshold is reached (in your case that
is 4). If you reduce the batch size, you get more rows. It looks like
for some reason only the last batch entry is being executed.

Visually I cannot see what is wrong with your code.

Given that a simple example (e.g.
https://gist.github.com/0x6e6562/8728333) is known to work, I'm
wondering if you can simplify your example a bit further to get rid of
more noise.

I'm sorry that this is not an answer to your question, but I can't
spend much more time on this now.

Cheers,

Ben

stanislav...@gmail.com

unread,
Jan 31, 2014, 11:14:17 AM1/31/14
to go...@googlegroups.com, harry...@gmail.com
You should move "rowdata := make([]interface{}, len(columns))" inside the loop. Otherwise it always points to the same place in memory and inserts only the latest value. Hope this helps.

harry...@gmail.com

unread,
Jan 31, 2014, 1:08:12 PM1/31/14
to go...@googlegroups.com, harry...@gmail.com, stanislav...@gmail.com

Yes! that was it. In fact, I was trying that after Ben's comment about only the last one going through. I came here to post that and I see your message :)

Somehow I thought the arguments (to a function) themselves are passed by value. Technically I am not passing the array, but instead array elements separated as arguments.

Thanks again for your time folks.
--
Harry

Ben Hood

unread,
Jan 31, 2014, 1:37:01 PM1/31/14
to stanislav...@gmail.com, go...@googlegroups.com, harry...@gmail.com
On Fri, Jan 31, 2014 at 4:14 PM, <stanislav...@gmail.com> wrote:
> You should move "rowdata := make([]interface{}, len(columns))" inside the loop. Otherwise it always points to the same place in memory and inserts only the latest value. Hope this helps.

Well spotted :-)
Reply all
Reply to author
Forward
0 new messages