Rhino ETL multiple result sets

187 views
Skip to first unread message

Remco Ros

unread,
Jul 20, 2012, 10:34:49 AM7/20/12
to rhino-t...@googlegroups.com
Hi,

I want to use Rhino ETL do normalize relational data to documents.

We need to fetch all products based on some heavy query.
then fetch all related data about these products (around 6 tables).

Because of performance I figured I don't want to join all these tables to products, because of the duplicate data.

So we use something like this now:

(all in one query):

---
CREATE TABLE #productids
(
productId uniqueidentifier
)

INSERT INTO #productids
SELECT p.ProductId
FROM
tblProduct p
// snip... some heavy query to determine all products we want.

SELECT * FROM tblProduct WHERE productId IN (SELECT productId FROM @productids)
SELECT * FROM tblProductCategories WHERE productId IN (SELECT productId FROM @productids) 
SELECT * FROM tblProductMaterialls WHERE productId IN (SELECT productId FROM @productids) 
SELECT * FROM tblProductTypes WHERE productId IN (SELECT productId FROM @productids) 
// etc....
---

currently I read all this data using SqlDataReader into a 'Product' class.
then convert this list of products to a dictionary (for fast access)
then rdr.NextResult()
then loop trough the datareader and fill the extra product data
then rdr.NextResult()
then loop trough the datareader and fill the extra product data
etc.


How can this process be expressed as an ETL process? Because, As far as I studied Rhino ETL. It works only of one set of rows, and has no support for multiple result sets.

Any ideas?

Nathan Palmer

unread,
Jul 22, 2012, 10:35:44 AM7/22/12
to rhino-t...@googlegroups.com
Remco,

You can do multiple result-sets by using the join operators. Here is an example using the DSL.


However if you need multiple streams into the same result-set it can still be done but with a custom operation. You'd want to selected from the first table and emit all of the rows.. then select from the second and emit all of those. At the end of the operation you'd get the combination of all of the tables. This is of course assuming all tables have the same columns.

Let me know if this makes sense or not. I might have an example of this somewhere if I dig through some code.

Nathan Palmer

--
You received this message because you are subscribed to the Google Groups "Rhino Tools Dev" group.
To view this discussion on the web visit https://groups.google.com/d/msg/rhino-tools-dev/-/_tQtqa4iCvkJ.
To post to this group, send email to rhino-t...@googlegroups.com.
To unsubscribe from this group, send email to rhino-tools-d...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/rhino-tools-dev?hl=en.

Remco Ros

unread,
Jul 22, 2012, 2:47:32 PM7/22/12
to rhino-t...@googlegroups.com, em...@nathanpalmer.com
The tables don't have the same schema/data.
I need to denormalize product data from sql server to solr documents (with multivalued fields).

the source schema is somewhat like i posted before (product / categories / articlegroup / product attributes).
But we don't need all data, we need just the products matching some sql criteria.
We found this query is somewhat costly, so we optimized it.

All in one sql command: 
- We first determine all product id's we want and insert them into a temp table.
- we select al product data (one row for each product)
- then select categories from products which match the temp table.
- then select other attributes for products in the temp table
- etc.

on the client side, I fetch the first results (products) and construct a Product object from it.
i put those in a dictionary<productid, product> for quick access later.
Then I fetch the next resultset.
I loop trough this set once, and add that data to the correct product using the previous dictionary: productsById[productid].Categories.Add(rdr["CatId"], etc.)

In the end, this leaves me with a large collection of denormalized product data.

I do all this in one AbstractOperation currently, passing all the products as a result for further processing.

I was hoping I could split this one big operation into several, so we can for example add a new operation for each new product attribute to extract/transform/load.

I could express this as one big SQL query and figure the joining out on the client side, but this would lead to one big cartesian product.

Op zondag 22 juli 2012 16:35:44 UTC+2 schreef Nathan Palmer het volgende:
Nathan Palmer

To post to this group, send email to rhino-tools-dev@googlegroups.com.
To unsubscribe from this group, send email to rhino-tools-dev+unsubscribe@googlegroups.com.

Nathan Palmer

unread,
Jul 22, 2012, 5:32:42 PM7/22/12
to rhino-t...@googlegroups.com
Seems like you could split it into separate operations.

Operation 1
  - Select ProductID from tblProducts

That first operation determines the product ids for the rest of the operations.

Operation 2
  - Branching. Use the BranchingOperation or MultiThreadedBranchingOperation to split the list of product ids into multiple streams

Operation 3, 4 and 5
  - Separate operations for categories, materials and types that you wire up to operation 2. Each operation will take the original result set and do what you need with it. You can run a join for each to pull the additional information needed.

Nathan

To view this discussion on the web visit https://groups.google.com/d/msg/rhino-tools-dev/-/CVw-zFwzUQ4J.

To post to this group, send email to rhino-t...@googlegroups.com.
To unsubscribe from this group, send email to rhino-tools-d...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages