Schema evolultion in HIVE dataset

77 views
Skip to first unread message

andrew stevenson

unread,
Mar 19, 2015, 2:10:08 PM3/19/15
to cdk...@cloudera.org

Hi Guys,

I’m trying to update an external HIVE dataset's schema but I’m having problems with schema evolution. I’m using the Sqoop to generate the Avro Schema and use this to initially create the dataset, then I update the source database table and extract the schema again. Finally I want to merge the schemas and update the dataset with the new repo but I get an incompatible schema error?

The inbound data will be from Sqoop but I can't directly write as Parquet since I need to use the direct modes to get the best extract performance.

def merge(source_schema: Schema) = {

   val target_descriptor = dataset.getDescriptor

  val target_schema = target_descriptor.getSchema

  log.info(source_schema.toString(true))


  if (!repo.exists(database, name)) log.error("Dataset %s not found".format(dataset_path))

    else {


      if (source_schema == target_schema) {

        log.info("No change in schemas detected.")

      }

      else {

        val updated_descriptor: DatasetDescriptor = new DatasetDescriptor.Builder(target_descriptor)

          .schema(source_schema)

          .build()

        //Datasets.update(dataset.getUri, updated_descriptor)

        repo.update(database, name, updated_descriptor)

      }

  }

}



Exception in thread "main" org.kitesdk.data.IncompatibleSchemaException: Schema cannot read data written using existing schema. Schema: {

  "type" : "record",

  "name" : "sqoop_import_categories",

  "doc" : "Sqoop import of categories",

  "fields" : [ {

    "name" : "category_id",

    "type" : [ "int", "null" ],

    "columnName" : "category_id",

    "sqlType" : "4"

  }, {

    "name" : "category_department_id",

    "type" : [ "int", "null" ],

    "columnName" : "category_department_id",

    "sqlType" : "4"

  }, {

    "name" : "category_name",

    "type" : [ "string", "null" ],

    "columnName" : "category_name",

    "sqlType" : "12"

  }, {

    "name" : "my_test_col",

    "type" : [ "int", "null" ],

    "columnName" : "my_test_col",

    "sqlType" : "4"

  }, {

    "name" : "my_test_col2",

    "type" : [ "int", "null" ],

    "columnName" : "my_test_col2",

    "sqlType" : "4"

  }, {

    "name" : "my_test_col3",

    "type" : [ "int", "null" ],

    "columnName" : "my_test_col3",

    "sqlType" : "4"

  } ],

  "tableName" : "categories"

}


Existing schema: {

  "type" : "record",

  "name" : "sqoop_import_categories",

  "doc" : "Sqoop import of categories",

  "fields" : [ {

    "name" : "category_id",

    "type" : [ "int", "null" ],

    "columnName" : "category_id",

    "sqlType" : "4"

  }, {

    "name" : "category_department_id",

    "type" : [ "int", "null" ],

    "columnName" : "category_department_id",

    "sqlType" : "4"

  }, {

    "name" : "category_name",

    "type" : [ "string", "null" ],

    "columnName" : "category_name",

    "sqlType" : "12"

  }, {

    "name" : "my_test_col",

    "type" : [ "int", "null" ],

    "columnName" : "my_test_col",

    "sqlType" : "4"

  }, {

    "name" : "my_test_col2",

    "type" : [ "int", "null" ],

    "columnName" : "my_test_col2",

    "sqlType" : "4"

  } ],

  "tableName" : "categories"

}



Thanks


Andrew

Joey Echeverria

unread,
Mar 19, 2015, 11:11:08 PM3/19/15
to andrew stevenson, cdk...@cloudera.org
Hi Andrew!

The problem is that you didn't set a default value for your new field,
my_test_col_3. That makes the schema not compatible because when it's
reading old data files written without that field Avro won't know what
to set that filed value to. This, along with the full set of schema
evolution rules we follow, are described on our site:

http://kitesdk.org/docs/1.0.0/Schema-Evolution.html

-Joey
> --
> You received this message because you are subscribed to the Google Groups
> "CDK Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cdk-dev+u...@cloudera.org.
> For more options, visit https://groups.google.com/a/cloudera.org/d/optout.



--
Joey Echeverria
Senior Infrastructure Engineer
Reply all
Reply to author
Forward
0 new messages