RhinoEtl: diagnostic information about INSERT executed in Load operations

28 views

Skip to first unread message

Stefano

unread,

Nov 6, 2011, 8:27:24 AM11/6/11

to rhino-t...@googlegroups.com

Hi all,

I'm writing an ETL process that should execute in the Load operation some INSERTS towards relational databases (that could be Sql Server, DB2, Oracle, or what else). I simply use and OutputCommandOperation for this.

The INSERT statements are quite simple and "standard" in their syntax, but could affect a lot of columns in a single table (in the order of a number of columns beetween 10 and 100).

Sometimes, during the development and testing of these systems, can happen that some value is not correctly managed by the destination database, e.g. a string value is longer than the supported length of the columns, a data type doesn't correspond, etc...

In these situations, the errors raised from the database could be quite "generic", e.g. the infamous "String or binary data would be truncated" of Sql Server that doesn't signal which is the column that causes the problem.

In some situations we could use a profiler, such as the one of Sql Server, and see which is the offending INSERT statement, and from there we can try to detect the offending column, but this is not always an applicable and efficient strategy.

In situations like this one, does RhinoEtl provide some kind of support for diagnostic, or can you suggest some steps to detect in an easier way the problem?

Jason Meckley

unread,

Nov 7, 2011, 1:10:09 PM11/7/11

to rhino-t...@googlegroups.com

validate the data before inserting the data. then filter out the invalid data before inserting. dump the invalid rows somewhere else so you can review, alter and process those specific rows again.

here is a simple example of validating the length of the string
foreach(var row in rows)
{
   var length = row[key].ToString().Length;
   if(length > max)
   {
        row["error"] = string.Format("The value is {0} characters, but the column only allows {1}", length, max)
   }
   yield return row;
}

you could expand on the validation make it more generic. something like
Func<Row, bool> Guard Against {get;set;}
string Message {get;set;}

foreach(var row in rows)
{
   Validate(row);
   yield return row;
}

private void Validate(Row row)
{
   if(GuardAgainst(row) == false) return;
   row["error"] = Message;
}

then in another operation
foreach(var row in rows)
{
   var haserrors = row.Contains("error");
   if(haserrors)
   {
        set row aside to fix and process later
        continue; //prevents the current row from continuing down the pipeline
   }
   yield return row; //valid rows only
}

your process would then look like this
Initialize()
{
   Regsiter(new InputCommand(...));
   Register(new ValidateRow {
                                               GuardAgainst = r => { r["key"].ToString().Length},
                                               Message = "The value of 'key' is too long"
                                          });
   Register(new ValidateRow {
                                               GuardAgainst = r => { r["other"] != null},
                                               Message = "The value of 'other' is null"
                                          });
   Register(new FilterOutRowsWithErrors());
   Register(new ConventionInputCommand(...));
}

Reply all

Reply to author

Forward

0 new messages