Schema evolution with HBase datasets

2,448 views
Skip to first unread message

Ben Roling

unread,
Jan 27, 2014, 3:13:27 PM1/27/14
to cdk...@cloudera.org
I was playing around with HBase datasets a little bit this morning and was curious how schema evolution is supposed to work.  I successfully read and wrote some data using the kite-examples HBase example and then tried to write some additional data with one new attribute added to the Avro model.

First, I tried to write the data without doing a DatasetRepository.update() to give the dataset the new schema.  I didn't know if I would be allowed to write data with a newer schema that is still compatible to be read by the dataset's defined schema.

I added favoriteFood in user.avsc:

...
    {
      "name": "age",
      "type": "int",
      "default": 0,
      "mapping": { "type": "column", "value": "meta:age" }
    },
    {
      "name": "favoriteFood",
      "type": ["string", "null"],
      "default": "null",
      "mapping": { "type": "column", "value": "meta:favoriteFood" }
    }
...

When I attempted to put an instance of this new User model, it failed with a NullPointerException:

Exception in thread "main" java.lang.NullPointerException
at org.kitesdk.data.hbase.avro.VersionedAvroEntityMapper.mapFromEntity(VersionedAvroEntityMapper.java:239)
at org.kitesdk.data.hbase.avro.VersionedAvroEntityMapper.mapFromEntity(VersionedAvroEntityMapper.java:58)
at org.kitesdk.data.hbase.impl.HBaseClientTemplate.put(HBaseClientTemplate.java:447)
at org.kitesdk.data.hbase.impl.HBaseClientTemplate.put(HBaseClientTemplate.java:421)
at org.kitesdk.data.hbase.impl.BaseDao.put(BaseDao.java:75)
at org.kitesdk.data.hbase.DaoDataset.put(DaoDataset.java:140)
at org.kitesdk.examples.data.WriteUserDataset.run(WriteUserDataset.java:58)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.kitesdk.examples.data.WriteUserDataset.main(WriteUserDataset.java:70)

I'm not sure whether you would expect writing data that doesn't exactly match the schema of the dataset to be allowed, but even if not it seems an NPE is a bug?

After that I tried to update the schema of the dataset with DatasetRepository.update().  That failed too.  It failed with this exception:

Exception in thread "main" org.kitesdk.data.IncompatibleSchemaException: Column mappings of schema not compatible with other schema for the table. ... (message trimmed for brevity)
at org.kitesdk.data.hbase.manager.DefaultSchemaManager.validateCompatibleWithTableSchemas(DefaultSchemaManager.java:532)
at org.kitesdk.data.hbase.manager.DefaultSchemaManager.migrateSchema(DefaultSchemaManager.java:293)
at org.kitesdk.data.hbase.HBaseMetadataProvider.update(HBaseMetadataProvider.java:130)
at org.kitesdk.data.hbase.HBaseDatasetRepository.update(HBaseDatasetRepository.java:60)
at org.kitesdk.examples.data.WriteUserDataset.run(WriteUserDataset.java:50)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.kitesdk.examples.data.WriteUserDataset.main(WriteUserDataset.java:73)

Just looking very briefly at the code it seems to avoid this exception I would have to create new column mappings for all of the attributes in the schema just to get this one new column in?  I'm thinking maybe this is a bug too?

Overall I'm just curious to know more about how schema evolution is expected to work with the HBase datasets.  An example that covers that would be something great to have in the documentation.

Thanks,
Ben

Tom White

unread,
Jan 28, 2014, 11:16:31 AM1/28/14
to Ben Roling, cdk...@cloudera.org
Hi Ben,

Thanks for the email. Comments inline.

On Mon, Jan 27, 2014 at 8:13 PM, Ben Roling <ben.r...@gmail.com> wrote:
> I was playing around with HBase datasets a little bit this morning and was
> curious how schema evolution is supposed to work. I successfully read and
> wrote some data using the kite-examples HBase example and then tried to
> write some additional data with one new attribute added to the Avro model.
>
> First, I tried to write the data without doing a DatasetRepository.update()
> to give the dataset the new schema. I didn't know if I would be allowed to
> write data with a newer schema that is still compatible to be read by the
> dataset's defined schema.
>
> I added favoriteFood in user.avsc:
>
> ...
> {
> "name": "age",
> "type": "int",
> "default": 0,
> "mapping": { "type": "column", "value": "meta:age" }
> },
> {
> "name": "favoriteFood",
> "type": ["string", "null"],
> "default": "null",
> "mapping": { "type": "column", "value": "meta:favoriteFood" }
> }
> ...

It's not causing the problems you are seeing, but you may want to say

"default": null,

so that the default is a null reference, not a string with value "null".

>
> When I attempted to put an instance of this new User model, it failed with a
> NullPointerException:
>
> Exception in thread "main" java.lang.NullPointerException
> at
> org.kitesdk.data.hbase.avro.VersionedAvroEntityMapper.mapFromEntity(VersionedAvroEntityMapper.java:239)
> at
> org.kitesdk.data.hbase.avro.VersionedAvroEntityMapper.mapFromEntity(VersionedAvroEntityMapper.java:58)
> at
> org.kitesdk.data.hbase.impl.HBaseClientTemplate.put(HBaseClientTemplate.java:447)
> at
> org.kitesdk.data.hbase.impl.HBaseClientTemplate.put(HBaseClientTemplate.java:421)
> at org.kitesdk.data.hbase.impl.BaseDao.put(BaseDao.java:75)
> at org.kitesdk.data.hbase.DaoDataset.put(DaoDataset.java:140)
> at org.kitesdk.examples.data.WriteUserDataset.run(WriteUserDataset.java:58)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> at org.kitesdk.examples.data.WriteUserDataset.main(WriteUserDataset.java:70)
>
> I'm not sure whether you would expect writing data that doesn't exactly
> match the schema of the dataset to be allowed, but even if not it seems an
> NPE is a bug?

This does look like a bug. I've opened
https://issues.cloudera.org/browse/CDK-292

>
> After that I tried to update the schema of the dataset with
> DatasetRepository.update(). That failed too. It failed with this
> exception:
>
> Exception in thread "main" org.kitesdk.data.IncompatibleSchemaException:
> Column mappings of schema not compatible with other schema for the table.
> ... (message trimmed for brevity)
> at
> org.kitesdk.data.hbase.manager.DefaultSchemaManager.validateCompatibleWithTableSchemas(DefaultSchemaManager.java:532)
> at
> org.kitesdk.data.hbase.manager.DefaultSchemaManager.migrateSchema(DefaultSchemaManager.java:293)
> at
> org.kitesdk.data.hbase.HBaseMetadataProvider.update(HBaseMetadataProvider.java:130)
> at
> org.kitesdk.data.hbase.HBaseDatasetRepository.update(HBaseDatasetRepository.java:60)
> at org.kitesdk.examples.data.WriteUserDataset.run(WriteUserDataset.java:50)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> at org.kitesdk.examples.data.WriteUserDataset.main(WriteUserDataset.java:73)

I just ran a unit test with this schema update and it passed for me.
Can you send the part that was trimmed so we can see what it reported?

Also, I just noticed a mistake in the example where the version of
user.avsc that was checked in had the 'age' field, even though it is
meant to be added manually as a migration. I've checked in a fix on
github. You might like to try the example again from scratch to see if
that part works for you.

>
> Just looking very briefly at the code it seems to avoid this exception I
> would have to create new column mappings for all of the attributes in the
> schema just to get this one new column in? I'm thinking maybe this is a bug
> too?
>
> Overall I'm just curious to know more about how schema evolution is expected
> to work with the HBase datasets. An example that covers that would be
> something great to have in the documentation.

I agree. The example we have at the moment is just a start, and we'd
like to add more. This is tracked by
https://issues.cloudera.org/browse/CDK-34 and
https://issues.cloudera.org/browse/CDK-35. In the meantime you can
look at some of the schema migration tests in
https://github.com/kite-sdk/kite/blob/master/kite-data/kite-data-hbase/src/test/java/org/kitesdk/data/hbase/avro/ManagedDaoTest.java.

Cheers,
Tom

>
> Thanks,
> Ben
>
> --
> You received this message because you are subscribed to the Google Groups
> "CDK Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cdk-dev+u...@cloudera.org.
> For more options, visit
> https://groups.google.com/a/cloudera.org/groups/opt_out.

Doug Cutting

unread,
Jan 28, 2014, 12:02:19 PM1/28/14
to Tom White, Ben Roling, cdk...@cloudera.org
On Tue, Jan 28, 2014 at 8:16 AM, Tom White <t...@cloudera.com> wrote:
>> "type": ["string", "null"],
>> "default": "null",
>
> It's not causing the problems you are seeing, but you may want to say
>
> "default": null,
>
> so that the default is a null reference, not a string with value "null".

Right: type names (like "null") are strings while default values are
JSON data structures (like null).

Also, the "null" should be first in the union, since a default value
is of the type in the first branch of the union.

Thus the idiom for a field whose value defaults to null is:

... "type":["null", <X>], "default":null ...

Right now the type of the default value is only checked at runtime, so
lots of folks make this error. It would be better to check it at
schema parse time, but expensive if default values are complex. (The
current implementation of default value parsing involves serializing
and deserializing the value with the appropriate DatumReader.)

Fixing this to throw exceptions at schema parse time would increase
the usability of Avro, but would also probably break a lot of
applications that have bad defaults but have never relied on them.

Doug

Ben Roling

unread,
Jan 28, 2014, 12:08:00 PM1/28/14
to Tom White, cdk...@cloudera.org
I mistakenly dropped the group from this thread by accidentally replying only directly to Tom.  This reply will bring the rest of the thread back into the group.


On Tue, Jan 28, 2014 at 10:56 AM, Ben Roling <ben.r...@gmail.com> wrote:
Sure, I will open a JIRA.  Note - the original example itself used "users" as the dataset name and "User" as the entity so the problem existed without my modification of the dataset name.  I only changed the dataset name from "users" to "kite_users" to avoid potential conflict with another "users" HBase table in the shared environment where I was executing my test.  I probably should have just run the test in a VM and kept everything as close to the original example as possible to avoid confusion in this discussion.


On Tue, Jan 28, 2014 at 10:49 AM, Tom White <t...@cloudera.com> wrote:
On Tue, Jan 28, 2014 at 4:37 PM, Ben Roling <ben.r...@gmail.com> wrote:
> Thanks Tom.  Indeed, "default": "null" was a mistake, but inconsequential to
> this discussion.  Thanks for pointing it out though.
>
> The rest of the trimmed message is below   Note - I changed the name of the
> "users" dataset to "kite_users" in my run, but I expect that should make no
> difference.  I debugged into the code a little bit and I think the problem
> may be at this statement: "if
> (!managedSchema.getName().equals(entitySchema.getName())) {".  The names
> being compared are not the same thing.  managedSchema.getName() is
> "kite_users" and entitySchema.getName() is "User".

I agree this looks like a bug. Changing the dataset name shouldn't
have any effect, however there are some complications due to
supporting multiple datasets in a single table which tie the two
concepts together in the current implementation (see
HBaseMetadataProvider#getEntityName). For the moment you should
probably use the same name for the entity and the dataset name.

Would you like to open a JIRA for this bug?

Thanks,
Tom

>
> Exception in thread "main" org.kitesdk.data.IncompatibleSchemaException:
> Column mappings of schema not compatible with other schema for the table.
> Table: kite_users. Other schema: {
>   "type" : "record",
>   "name" : "User",
>   "namespace" : "org.kitesdk.examples.data",
>   "doc" : "A user record",
>   "fields" : [ {
>     "name" : "username",
>     "type" : "string",
>     "mapping" : {
>       "type" : "key",
>       "value" : "0"
>     }
>   }, {
>     "name" : "creationDate",
>     "type" : "long",
>     "mapping" : {
>       "type" : "column",
>       "value" : "meta:creationDate"
>     }
>   }, {
>     "name" : "favoriteColor",
>     "type" : "string",
>     "mapping" : {
>       "type" : "column",
>       "value" : "meta:favoriteColor"
>     }
>   }, {

>     "name" : "age",
>     "type" : "int",
>     "default" : 0,
>     "mapping" : {
>       "type" : "column",
>       "value" : "meta:age"
>     }
>   } ]
> } New schema:
> {"type":"record","name":"User","namespace":"org.kitesdk.examples.data","doc":"A
> user
> record","fields":[{"name":"username","type":"string","mapping":{"type":"key","value":"0"}},{"name":"creationDate","type":"long","mapping":{"type":"column","value":"meta:creationDate"}},{"name":"favoriteColor","type":"string","mapping":{"type":"column","value":"meta:favoriteColor"}},{"name":"age","type":"int","default":0,"mapping":{"type":"column","value":"meta:age"}},{"name":"favoriteFood","type":["string","null"],"default":"null","mapping":{"type":"column","value":"meta:favoriteFood"}}]}
> at
> org.kitesdk.data.hbase.manager.DefaultSchemaManager.validateCompatibleWithTableSchemas(DefaultSchemaManager.java:532)

Ben Roling

unread,
Jan 28, 2014, 12:17:22 PM1/28/14
to cdk...@cloudera.org, Tom White
JIRA logged with regard to expectation of dataset and entity names matching:


On Tuesday, January 28, 2014 11:08:00 AM UTC-6, Ben Roling wrote:
I mistakenly dropped the group from this thread by accidentally replying only directly to Tom.  This reply will bring the rest of the thread back into the group.
On Tue, Jan 28, 2014 at 10:56 AM, Ben Roling wrote:
Sure, I will open a JIRA.  Note - the original example itself used "users" as the dataset name and "User" as the entity so the problem existed without my modification of the dataset name.  I only changed the dataset name from "users" to "kite_users" to avoid potential conflict with another "users" HBase table in the shared environment where I was executing my test.  I probably should have just run the test in a VM and kept everything as close to the original example as possible to avoid confusion in this discussion.


On Tue, Jan 28, 2014 at 10:49 AM, Tom White wrote:

Tom White

unread,
Jan 28, 2014, 12:22:58 PM1/28/14
to Doug Cutting, Ben Roling, cdk...@cloudera.org
I think it would be good to have Avro check for this case at schema
parse time. We might add an option to make the parser fail, defaulting
to off for compatibility. The tools could emit warnings when they
detect the problem.

Tom

>
> Doug

Doug Cutting

unread,
Jan 28, 2014, 12:38:47 PM1/28/14
to Tom White, Ben Roling, cdk...@cloudera.org
On Tue, Jan 28, 2014 at 9:22 AM, Tom White <t...@cloudera.com> wrote:
> I think it would be good to have Avro check for this case at schema
> parse time. We might add an option to make the parser fail, defaulting
> to off for compatibility. The tools could emit warnings when they
> detect the problem.

+1

It can be implemented reasonably efficiently without much work. We
need a validateDefault(Schema, JsonNode) method. Primitives are easy.
Unions recurse on their first branch. Records recurse on each field
with the correspondingly named value in the default, or, when that's
missing, the default from the field itself, if any. This doesn't have
to worry about how default values will be represented as data, only
whether their JSON representation is compatible with the schema.

Will you file the Jira or should I?

Doug

Tom White

unread,
Jan 29, 2014, 10:16:56 AM1/29/14
to Doug Cutting, Ben Roling, cdk...@cloudera.org
The implementation sketch sounds good to me. I filed
https://issues.apache.org/jira/browse/AVRO-1449 for this.

Thanks,
Tom
Reply all
Reply to author
Forward
0 new messages