merging two tables, when the raws of table1 are a subset of the raws of table2

9 views
Skip to first unread message

איריס עירון (איתי)

unread,
May 24, 2010, 10:11:24 AM5/24/10
to Jaql Users
jaql> $y = [{userid: 'iris', attr: 'a', value: 2},{userid: 'dafna',
attr: 'b', value: 3}];

We would still want to add the attr/value of $taxonomy to each of the
userids of $y
and get:

[
{
"userid": "iris",
"attr": "a",
"value": 3
},
{
"userid": "iris",
"attr": "b",
"value": 1
},
{
"userid": "iris",
"attr": "c",
"value": 1
},
{
"userid": "dafna",
"attr": "b",
"value": 4
},
{
"userid": "dafna",
"attr": "c",
"value": 1
},
{
"userid": "dafna",
"attr": "a",
"value": 1
}
]

I was thinking of replicate each row of $taxonomy times the number of
users:
jaql> $users= group $y by userid=$.userid into {userid};
jaql> $taxonomy1 = join $taxonomy,$users into {$taxonomy.*,$users.*}
but I can't do that since join has to be fully connected.

Also, it does not seem very efficient since $y may contain many
addional fields each with its own set of values, and the size of
$taxonomy1 will be much bigger than the original size of
$taxonomy ....

Thanks,
- Iris

--
You received this message because you are subscribed to the Google Groups "Jaql Users" group.
To post to this group, send email to jaql-...@googlegroups.com.
To unsubscribe from this group, send email to jaql-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/jaql-users?hl=en.

Kevin Beyer

unread,
May 25, 2010, 4:42:00 AM5/25/10
to Jaql Users
Iris --

I'm not sure what you're asking. What does taxonomy look like? What
do you mean by "add" to it?

I'll answer a similar sounding question: Say I have a a record that
represents default values for a bunch of attributes, and for each user
I have a record which overrides or adds additional attributes to that
record. How can I merge the two records? The 'remap' function does
the job. It takes two records. The first has the the default field
values and the second record has the new/replacement fields.

defaults = { a: 3, b: 1, c: 1 };

users = [
{ userid: 'iris', data: { a: 2, c: 4 }},
{ userid: 'dafna', data: { b: 3 }},
];

users
-> transform { $.userid, data: remap( defaults, $.data ) }
;

result:

[
{ "userid": "iris", "data": { "a": 2, "b": 1, "c": 4 }},
{ "userid": "dafna", "data": { "a": 3, "b": 3, "c": 1 }}
]


In your case, it appears you have the values as array elements, rather
than a record. The following works if the default table and the
number of attributes per user are relatively small. In particular,
the default table might be passed around to the map tasks
inefficiently.

defaults = [
{ attr: "a", value: 3 },
{ attr: "b", value: 1 },
{ attr: "c", value: 1 }
];

overrides = [
{userid: 'iris', attr: 'a', value: 2 },
{userid: 'dafna', attr: 'b', value: 3 },
{userid: 'iris', attr: 'c', value: 4 }
];

updateTable = fn( defaults, overrides )
group defaults by $.attr as D,
overrides by $.attr as O
into firstNonNull( singleton(O), singleton(D) )
;

updateTable( defaults, overrides -> filter $.userid == 'iris' );

[
{
"attr": "b",
"value": 1
},
{
"attr": "c",
"userid": "iris",
"value": 4
},
{
"attr": "a",
"userid": "iris",
"value": 2
}
]


overrides
-> group by userid = $.userid as O expand (
updateTable( defaults, O )
-> transform { userid, ${*-.userid} }
)
;


[
{
"userid": "iris",
"attr": "b",
"value": 1
},
{
"userid": "iris",
"attr": "c",
"value": 4
},
{
"userid": "iris",
"attr": "a",
"value": 2
},
{
"userid": "dafna",
"attr": "b",
"value": 3
},
{
"userid": "dafna",
"attr": "c",
"value": 1
},
{
"userid": "dafna",
"attr": "a",
"value": 3
}
]

Iris Eiron

unread,
Jun 21, 2010, 8:59:03 AM6/21/10
to Jaql Users
Hi Kevin,

I tried to follow your example, and defined the following function

$applyDefaults = fn ([{attribute: string, value: string, TF:
long}...] $values, [{attribute: string, value: string, TF: long, P:
double}...] $defaults) (
join $values, preserve $defaults
where $values.attribute==$defaults.attribute
and $values.value==$defaults.value
into {$defaults.attribute, $defaults.value, P: $defaults.P, TF:
firstNonNull($values.TF, $defaults.TF)}
-> transform {$.attribute, $.value, P: (if($.TF>0) ($.P) else (1-
$.P))}
-> group into sum($[*].P)
-> singleton ()
);

I tried this function on an example and it ran successfully. Here's my
sample and the (expected results):

$userClassAttrValueProb =
[
{
"P": 0.1,
"attribute": "genre",
"class": 1,
"userid": 673,
"value": "thriller"
},
{
"P": 0.2,
"attribute": "genre",
"class": 1,
"userid": 673,
"value": "sport"
}
];

$docAttrValueTF=
[
{
"docid": 1,
"attribute": "genre",
"value": "thriller",
"TF": 1
},
{
"docid": 2,
"attribute": "genre",
"value": "sport",
"TF": 1
},
{
"docid": 3
},
{
"docid": 4,
"attribute": "genre",
"value": "thriller",
"TF": 1
},
{
"docid": 4,
"attribute": "genre",
"value": "sport",
"TF": 1
}
];

jaql> $userClassAttrValueDefaultProb = $userClassAttrValueProb
-> group by keys={$.userid,$.class} as values
into {keys.*, defaults: (values -> transform
{$.attribute, $.value, TF: 0, P: $.P})};

jaql> $docAttrValueTFSet = $docAttrValueTF
-> group by keys={$.docid} as values
into {keys.*, values: (values -> transform
{$.attribute, $.value, $.TF})};

jaql> for( u in $userClassAttrValueDefaultProb, d in
$docAttrValueTFSet)
[{ u.userid, u.class, d.docid, P: $applyDefaults (d.values,
u.defaults) }] ;
[
{
"userid": 673,
"class": 1,
"docid": 1,
"P": 0.9
},
{
"userid": 673,
"class": 1,
"docid": 4,
"P": 0.30000000000000004
},
{
"userid": 673,
"class": 1,
"docid": 2,
"P": 1.1
},
{
"userid": 673,
"class": 1,
"docid": 3,
"P": 1.7000000000000002
}
]


HOWEVER, when I tried to run it on my real data which is much bigger
I, Jaql threw an exception. I was able to calculate
$userClassAttrValueDefaultProb and $docAttrValueTFSet and to write
them into hdfs. The sequence I was trying to perform was therefore:


jaql> $userClassAttrValueDefaultProb = read(hdfs(strcat("hdfs/nb/1c/
userClassAttrValueDefaultProb")));

jaql> $docAttrValueTFSet = read(hdfs(strcat("hdfs/nb/1c/
docAttrValueTFSet")));

jaql> $scores = for( u in $userClassAttrValueDefaultProb, d in
$docAttrValueTFSet)
[{ u.userid, u.class, d.docid, P: $applyDefaults (d.values,
u.defaults) }];

jaql> $scores -> write(hdfs(strcat("hdfs/nb/1c/scores-
sum.bernoulli")));

That causes a java.lang.RuntimeException: Error in configuring
object.

Any idea what is causing this failure?

Thanks,
- Iris

Iris Eiron

unread,
Jun 21, 2010, 4:31:17 PM6/21/10
to Jaql Users
I added the output of explain:

jaql> explain $scores -> write(hdfs("hdfs/nb/1c/
genre.colorInfo.certificate.keyword/scores-sum.bernoulli"));
(
$signature = ( system::const([
"genre",
"colorInfo",
"certificate",
"keyword"
])
-> system::combine(fn(a, b) (system::strcat(a, ".", b))) ),
$fd_0 = system::mapReduce({ ("input"):
(system::hdfs(system::strcat("hdfs/nb/1c/", $signature, "/
docAttrValueTF"))), ("output"):(system::HadoopTemp(schema=schema {
"docid"?: any,
"values": [
{
"attribute"?: any,
"value"?: any,
"TF"?: any
} ...
]
})), ("map"):(fn(schema [
any ...
] $mapIn0) ($mapIn0
-> transform each $i0 ([{ ($i0).("docid") }, $i0]))), ("reduce"):
(fn(schema {
"docid"?: any
} keys, schema [
any ...
] values) ([{ (keys).*, ("values"):(values
-> transform each $ ({ ($).("attribute"), ($).("value"), ($).
("TF") })) }])), ("schema"):(system::const({
"key": schema {
"docid"?: any
},
"value": schema any
})), ("options"):(null) }),
system::mapReduce({ ("input"):(system::hdfs(system::strcat("hdfs/nb/
1c/", $signature, "/userClassAttrValueDefaultProb"))), ("map"):
(fn(schema [
any ...
] $mapIn) ($mapIn -> expand each u ( system::read($fd_0)
-> transform each d ({ (u).("userid"), (u).("class"), (d).("docid"),
("P"):((
group each $ in (
group each $join_in in ((u).("defaults")
-> transform each $defaults ({ ("defaults")?:($defaults) })) by
$join_on
= ([(($join_in).("defaults")).("attribute"), (($join_in).
("defaults")).("value")]) as $as_0, ((d).("values")
-> transform each $values ({ ("values")?:($values) })) by
([(($join_in).("values")).("attribute"), (($join_in).("values")).
("value")]) as $as_1 expand (

if( not (isnull ($join_on)) )
( system::nullElementOnEmpty($as_1) -> expand each $b ( $as_0
-> transform each $a ({ ($a).*, ($b).* }) )
)
else ( $as_0
-> transform each $a ({ ($a).* }) )
)

-> transform each $ ({ (($).("defaults")).("attribute"), (($).
("defaults")).("value"), ("P"):((($).("defaults")).("P")), ("TF"):
(system::firstNonNull((($).("values")).("TF"), (($).("defaults")).
("TF"))) })
-> transform each $ ({ ($).("attribute"), ($).("value"), ("P"):(
if( (($).("TF")) > (0) )
( ($).("P") )
else ( (1)-(($).("P")) )
) })) by $__unused__
= (null) as $ expand (
$
-> aggregate as $ full [ ( $
-> transform each $ (($).("P"))
-> system::sum() )]
-> transform each $ (system::index($, 0)))

-> system::singleton() )) }) )

-> transform each $fv ([null, $fv]))), ("schema"):(system::const({
"key": schema null,
"value": schema {
"userid"?: any,
"class"?: any,
"docid"?: any,
"P": any
}
})), ("output"):(system::const({
"location": "hdfs/nb/1c/genre.colorInfo.certificate.keyword/scores-
sum.bernoulli",
"type": "hdfs"
})) })
)


jaql>

Kevin Beyer

unread,
Jun 25, 2010, 10:27:11 PM6/25/10
to Jaql Users
Jaql really needs better pretty printing, but it is
straight-forward to do it by hand if you have parenthesis
matching in your editor of choice. I reformatted the explain and
added some comments to highlight possible trouble:


jaql> explain $scores -> write(hdfs("hdfs/nb/1c/
genre.colorInfo.certificate.keyword/scores-sum.bernoulli"));


(
$signature = ( system::const([
"genre",
"colorInfo",
"certificate",
"keyword"
])
-> system::combine(fn(a, b) (system::strcat(a, ".", b))) ),


$fd_0 = system::mapReduce({
("input"):(system::hdfs(system::strcat("hdfs/nb/1c/", $signature,
"/docAttrValueTF"))),
("output"):(system::HadoopTemp(schema=schema { "docid"?: any,
"values":
[ { "attribute"?: any, "value"?: any, "TF"?:
any } ... ] })),

("map"):(fn(schema [ any ... ] $mapIn0) (
$mapI0n
-> transform each $i0 ([{ ($i0).("docid") }, $i0]))),

//-----------------------------------------------------------------
// All the values for one group are forming a nested array,
// which can cause performance issues if the arrays are large.
// However, in this case, I believe that is exactly what you wanted
// to do, and the nested arrays are modest in size.
//-----------------------------------------------------------------


("reduce"): (fn(schema { "docid"?: any } keys, schema [ any ... ]
values) (
[{ (keys).*,
("values"): (values -> transform each $ ({ ($).
("attribute"), ($).("value"), ($). ("TF") })) }])),

("schema"):(system::const({ "key": schema { "docid"?: any },
"value": schema any })),
("options"):(null) }),


system::mapReduce({
("input"):(system::hdfs(system::strcat("hdfs/nb/1c/",
$signature, "/userClassAttrValueDefaultProb"))),

("map"): (fn(schema [ any ... ] $mapIn) (
$mapIn
-> expand each u (

//-----------------------------------------------------------------
// For each outer element, the entire inner file is re-read.
//-----------------------------------------------------------------

system::read($fd_0)
-> transform each d (
{ (u).("userid"),
(u).("class"),
(d).("docid"),
("P"):((

//-----------------------------------------------------------------
// For each pair of values [O(N^2)],
// a join is done [at least O(N^3), possibly O(N^4)]
// Of course these are different N, so it might be ok, or it
// might take a really long time. Is there some way to reuse
// computations over these O(N^2) joins?
//-----------------------------------------------------------------

group each $join_in in
( u.defaults -> transform each
$defaults { defaults?: $defaults })
by $join_on =
[ $join_in.defaults.attribute, $join_in.defaults.value ]
as $as_0,
( d.values -> transform each $values
{ values?: $values } )
by [ $join_in.values.attribute,
$join_in.values.value ]
as $as_1
expand (
if( not (isnull ($join_on)) ) (
system::nullElementOnEmpty($as_1)
-> expand each $b (
$as_0 -> transform each $a
({ ($a).*, ($b).* })
)
) else (
$as_0 -> transform each $a
({ ($a).* })
))

-> transform each $ {
$.defaults.attribute,
$.defaults.value,
P: $.defaults.P,
TF: system::firstNonNull($.values.TF,
$.defaults.TF) }
-> transform each $ {
$.attribute,
$.value,
P:( if( $.TF > 0 ) $.P else 1-$.P ) }
-> group each $ by $__unused__ = null as $
expand (
$ -> aggregate as $ full [ $ ->
transform each $ $.P -> system::sum() ]
-> transform each $ system::index($,
0)
-> system::singleton()
))
}))

-> transform each $fv [null, $fv]
)),
Reply all
Reply to author
Forward
0 new messages