Attributed data sources

18 views
Skip to first unread message

Justin Richer

unread,
Jun 17, 2011, 3:55:45 PM6/17/11
to portable...@googlegroups.com
We're working on a small project that would collect profile information
about a person from multiple data sources and rebroadcast them as a
single profile. These datasources would have a subsumptive order
listing, such that higher source's value takes priority over the lower
source value. For the case of these examples, let's call the higher
priority source "primary" and the lower priority one "secondary", with
this info:

Primary:
{
displayName: "Harold R. Smith",
emails: [ { type: "work", value: "hsm...@example.com" } ]
phoneNumbers: [ { type: "work", value: "123-456-7890" } ]
}

Secondary:
{
displayName: "Bob Smith",
phoneNumbers: [ {type: "work", value: "9-1-312-1234" } ]
}

The "simple" requested form of this profile would be a basic PoCo object
that contains the aggregated or overshadowed data:

{
displayName: "Harold R. Smith",
emails: [ { type: "work", value: "hsm...@example.com" } ]
phoneNumbers: [ { type: "work", value: "123-456-7890" },
{ type: "work", value: "9-1-312-1234" } ]
}

We're also going to let users request this same combined form with
different datasource orderings, so with "secondary" listed first:

{
displayName: "Bob Smith",
emails: [ { type: "work", value: "hsm...@example.com" } ]
phoneNumbers: [ { type: "work", value: "9-1-312-1234" } ,
{ type: "work", value: "123-456-7890" }]
}

Now here's where things get tricky. We'd like to give people a more
in-depth full-profile view that expresses all of the metadata associated
with a datasource, such as the name of the source, the update timestamp,
and things like that. Furthermore, say you wanted to get *all*
displayNames for a person, along with their data sources. This isn't too
bad when you look at the plural fields, since we can just add a
"source:" member to them. Even the complex fields aren't overly tricky
for the same reason, but there's the problem of repeating a member name
in an object (which I'm pretty sure will break a lot of parsers). The
problem gets even worse with the simple singular fields, like
displayName above.

To express something like this, we'd almost certainly be returning
something that isn't really PoCo, and that's fine with us. But we'd like
to get it to stick somewhat close to the original if we can. I've been
mulling this over and have come up with two possible solutions, neither
of which I'm particularly happy about, so I wanted to throw it back to
the PoCo community and see if anyone else has been thinking about
something like this.

First, we could have a parallel data structure named "sources" that
simply iterates the source attribute for each component, as:

{
displayName: "Harold R. Smith",
emails: [ { type: "work", value: "hsm...@example.com" } ]
phoneNumbers: [ { type: "work", value: "123-456-7890" },
{ type: "work", value: "9-1-312-1234" } ]

sources: {
displayName: "primary",
emails: "primary",
phoneNumbers: ["primary", "secondary"]
}

This is a hacky way to get the attribution in while leaving the main
structure alone. It doesn't let us do things like have a list of
displayNames, though.

Second, we could just extend all simple values with a multi-part data
structure, extend all plural values with a source (and other)
attributes, and wrap complex values just like the single values:

{
displayName: [{ value: "Harold R. Smith", source: "primary" }]
emails: [ { type: "work", value: "hsm...@example.com", source:
"primary"} ]
phoneNumbers: [ { type: "work", value: "123-456-7890", source:
"primary" },
{ type: "work", value:
"9-1-312-1234", source: "secondary" } ]
}

This totally breaks assumptions on what's living at the end of each
member name by throwing arrays of objects where strings once were, but
it keeps the base structure the same.

Third, we could skip the serverside aggregation in this case and just
return the individual PoCo objects, as:

{
primary:
{
displayName: "Harold R. Smith",
emails: [ { type: "work", value: "hsm...@example.com" } ]
phoneNumbers: [ { type: "work", value: "123-456-7890" } ]
},
secondary:
{
displayName: "Bob Smith",
phoneNumbers: [ {type: "work", value: "9-1-312-1234" } ]
}
}

This basically puts all the metadata in the envelope, but it prevents
the server from smushing things together in any intelligent way (such as
eliminating duplicates).

I'm still very open to suggestions.
-- Justin

Joseph Smarr

unread,
Jun 17, 2011, 6:05:31 PM6/17/11
to portable...@googlegroups.com
Justin-thanks for the question, this sounds like a useful app! :) I think the first decision you need to make is whether you want the output to be a single valid contact (e.g. only one displayName) + extra metadata about where it came from, OR all of the data, even though it's not a valid contact (e.g. multiple displayNames). Maybe what you really want is a hybrid of the two--a single aggregated contact (perhaps with source info), followed by a raw contact for each source, provided for reference. That way the user could just take the "best guess" if they wanted, and the source data would tell them where each field came from, and they could fall back on the separate per-source raw contacts if they wanted more info or to let the user pick which field to keep etc. See what i mean?

BTW I think that separate sources object is a good way to go...we've often had people ask things like "how do i add extra metadata to singular fields like gender, since they're just key/value pairs", and the standard answer is to just invent a new field to hold that metadata. That way you don't get in the way of normal use, but it's still clear and easy how to find and stitch together that extra metadata.

Hope this helps, lemme know if I can help answer any other questions! js


--
You received this message because you are subscribed to the Google Groups "PortableContacts" group.
To post to this group, send email to portable...@googlegroups.com.
To unsubscribe from this group, send email to portablecontac...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/portablecontacts?hl=en.


Richer, Justin P.

unread,
Jun 19, 2011, 7:45:13 PM6/19/11
to portable...@googlegroups.com
I'm really not a fan of the parallel data arrays approach in general. Lots of stitching required on the client side and it makes the structure more fragile, in my experience. We could do something like keep the combined object in the root and put a sources object full of also-valid objects indexed by the source name as in the last example. Still mulling this over though, and we'll post our schema back out. If there's interest in the app, we could try to release it as opensource, too.

-- Justin
________________________________________
From: portable...@googlegroups.com [portable...@googlegroups.com] On Behalf Of Joseph Smarr [jsm...@gmail.com]
Sent: Friday, June 17, 2011 6:05 PM
To: portable...@googlegroups.com
Subject: Re: Attributed data sources

Justin-thanks for the question, this sounds like a useful app! :) I think the first decision you need to make is whether you want the output to be a single valid contact (e.g. only one displayName) + extra metadata about where it came from, OR all of the data, even though it's not a valid contact (e.g. multiple displayNames). Maybe what you really want is a hybrid of the two--a single aggregated contact (perhaps with source info), followed by a raw contact for each source, provided for reference. That way the user could just take the "best guess" if they wanted, and the source data would tell them where each field came from, and they could fall back on the separate per-source raw contacts if they wanted more info or to let the user pick which field to keep etc. See what i mean?

BTW I think that separate sources object is a good way to go...we've often had people ask things like "how do i add extra metadata to singular fields like gender, since they're just key/value pairs", and the standard answer is to just invent a new field to hold that metadata. That way you don't get in the way of normal use, but it's still clear and easy how to find and stitch together that extra metadata.

Hope this helps, lemme know if I can help answer any other questions! js

On Fri, Jun 17, 2011 at 12:55 PM, Justin Richer <jri...@mitre.org<mailto:jri...@mitre.org>> wrote:
We're working on a small project that would collect profile information about a person from multiple data sources and rebroadcast them as a single profile. These datasources would have a subsumptive order listing, such that higher source's value takes priority over the lower source value. For the case of these examples, let's call the higher priority source "primary" and the lower priority one "secondary", with this info:

Primary:
{
displayName: "Harold R. Smith",

emails: [ { type: "work", value: "hsm...@example.com<mailto:hsm...@example.com>" } ]


phoneNumbers: [ { type: "work", value: "123-456-7890" } ]
}

Secondary:
{
displayName: "Bob Smith",
phoneNumbers: [ {type: "work", value: "9-1-312-1234" } ]
}

The "simple" requested form of this profile would be a basic PoCo object that contains the aggregated or overshadowed data:

{
displayName: "Harold R. Smith",

emails: [ { type: "work", value: "hsm...@example.com<mailto:hsm...@example.com>" } ]


phoneNumbers: [ { type: "work", value: "123-456-7890" },
{ type: "work", value: "9-1-312-1234" } ]
}

We're also going to let users request this same combined form with different datasource orderings, so with "secondary" listed first:

{
displayName: "Bob Smith",
emails: [ { type: "work", value: "hsm...@example.com<mailto:hsm...@example.com>" } ]


phoneNumbers: [ { type: "work", value: "9-1-312-1234" } ,
{ type: "work", value: "123-456-7890" }]
}

Now here's where things get tricky. We'd like to give people a more in-depth full-profile view that expresses all of the metadata associated with a datasource, such as the name of the source, the update timestamp, and things like that. Furthermore, say you wanted to get *all* displayNames for a person, along with their data sources. This isn't too bad when you look at the plural fields, since we can just add a "source:" member to them. Even the complex fields aren't overly tricky for the same reason, but there's the problem of repeating a member name in an object (which I'm pretty sure will break a lot of parsers). The problem gets even worse with the simple singular fields, like displayName above.

To express something like this, we'd almost certainly be returning something that isn't really PoCo, and that's fine with us. But we'd like to get it to stick somewhat close to the original if we can. I've been mulling this over and have come up with two possible solutions, neither of which I'm particularly happy about, so I wanted to throw it back to the PoCo community and see if anyone else has been thinking about something like this.

First, we could have a parallel data structure named "sources" that simply iterates the source attribute for each component, as:

{
displayName: "Harold R. Smith",

emails: [ { type: "work", value: "hsm...@example.com<mailto:hsm...@example.com>" } ]


phoneNumbers: [ { type: "work", value: "123-456-7890" },
{ type: "work", value: "9-1-312-1234" } ]

sources: {
displayName: "primary",
emails: "primary",
phoneNumbers: ["primary", "secondary"]
}

This is a hacky way to get the attribution in while leaving the main structure alone. It doesn't let us do things like have a list of displayNames, though.

Second, we could just extend all simple values with a multi-part data structure, extend all plural values with a source (and other) attributes, and wrap complex values just like the single values:

{
displayName: [{ value: "Harold R. Smith", source: "primary" }]

emails: [ { type: "work", value: "hsm...@example.com<mailto:hsm...@example.com>", source: "primary"} ]


phoneNumbers: [ { type: "work", value: "123-456-7890", source: "primary" },
{ type: "work", value: "9-1-312-1234", source: "secondary" } ]
}

This totally breaks assumptions on what's living at the end of each member name by throwing arrays of objects where strings once were, but it keeps the base structure the same.

Third, we could skip the serverside aggregation in this case and just return the individual PoCo objects, as:

{
primary:
{
displayName: "Harold R. Smith",

emails: [ { type: "work", value: "hsm...@example.com<mailto:hsm...@example.com>" } ]


phoneNumbers: [ { type: "work", value: "123-456-7890" } ]
},
secondary:
{
displayName: "Bob Smith",
phoneNumbers: [ {type: "work", value: "9-1-312-1234" } ]
}
}

This basically puts all the metadata in the envelope, but it prevents the server from smushing things together in any intelligent way (such as eliminating duplicates).

I'm still very open to suggestions.
-- Justin

--
You received this message because you are subscribed to the Google Groups "PortableContacts" group.

To post to this group, send email to portable...@googlegroups.com<mailto:portable...@googlegroups.com>.
To unsubscribe from this group, send email to portablecontac...@googlegroups.com<mailto:portablecontacts%2Bunsu...@googlegroups.com>.

Justin Richer

unread,
Jul 5, 2011, 4:33:14 PM7/5/11
to portable...@googlegroups.com
Just had another thought on this after seeing some chatter on the
OpenID/AB list. What if we were to use hash extensions to the values to
mark sources for simple objects and use a "source" attribute for complex
objects, such as:

{
displayName#primary: "Harold R. Smith",
displayName#secondary: "Bob Smith",
emails: [ { type: "work", value: "hsm...@example.com<mailto:hsm...@example.com>", source: primary } ]


phoneNumbers: [ { type: "work", value: "123-456-7890", source: "primary" },
{ type: "work", value: "9-1-312-1234", source: "secondary" } ]
}


If we wanted to keep wire compatibility with PoCo, we could even add in
the "displayName" tag using the same subsumptive rules as discussed
below.

Thoughts on this solution?
-- Justin

On Sun, 2011-06-19 at 19:45 -0400, Richer, Justin P. wrote:
> I'm really not a fan of the parallel data arrays approach in general. Lots of stitching required on the client side and it makes the structure more fragile, in my experience. We could do something like keep the combined object in the root and put a sources object full of also-valid objects indexed by the source name as in the last example. Still mulling this over though, and we'll post our schema back out. If there's interest in the app, we could try to release it as opensource, too.
>
> -- Justin
> ________________________________________
> From: portable...@googlegroups.com [portable...@googlegroups.com] On Behalf Of Joseph Smarr [jsm...@gmail.com]
> Sent: Friday, June 17, 2011 6:05 PM
> To: portable...@googlegroups.com
> Subject: Re: Attributed data sources
>

> Justin-thanks for the question, this sounds like a useful app! :) I think the first decision you need to make is whether you want the output to be a single valid contact (e.g. only one displayName) + extra metadata about where it came from, OR all of the data, even though it's not a valid contact (e.g. multiple displayNames). Maybe what you really want is a hybrid of the two--a single aggregated contact (perhaps with source info), followed by a raw contact for each source, provided for reference. That way the user could just take the "best guess" if they wanted, and the source data would tell them where each field came from, and they could fall back on the separate per-source raw contacts if they wanted more info or to let the user pick which field to keep etc. See what i mean?
>
> BTW I think that separate sources object is a good way to go...we've often had people ask things like "how do i add extra metadata to singular fields like gender, since they're just key/value pairs", and the standard answer is to just invent a new field to hold that metadata. That way you don't get in the way of normal use, but it's still clear and easy how to find and stitch together that extra metadata.
>
> Hope this helps, lemme know if I can help answer any other questions! js
>

> On Fri, Jun 17, 2011 at 12:55 PM, Justin Richer <jri...@mitre.org<mailto:jri...@mitre.org>> wrote:
> We're working on a small project that would collect profile information about a person from multiple data sources and rebroadcast them as a single profile. These datasources would have a subsumptive order listing, such that higher source's value takes priority over the lower source value. For the case of these examples, let's call the higher priority source "primary" and the lower priority one "secondary", with this info:
>
> Primary:
> {
> displayName: "Harold R. Smith",

> emails: [ { type: "work", value: "hsm...@example.com<mailto:hsm...@example.com>" } ]


> phoneNumbers: [ { type: "work", value: "123-456-7890" } ]
> }
>
> Secondary:
> {
> displayName: "Bob Smith",
> phoneNumbers: [ {type: "work", value: "9-1-312-1234" } ]
> }
>
> The "simple" requested form of this profile would be a basic PoCo object that contains the aggregated or overshadowed data:
>
> {
> displayName: "Harold R. Smith",

> emails: [ { type: "work", value: "hsm...@example.com<mailto:hsm...@example.com>" } ]


> phoneNumbers: [ { type: "work", value: "123-456-7890" },
> { type: "work", value: "9-1-312-1234" } ]
> }
>
> We're also going to let users request this same combined form with different datasource orderings, so with "secondary" listed first:
>
> {
> displayName: "Bob Smith",

> emails: [ { type: "work", value: "hsm...@example.com<mailto:hsm...@example.com>" } ]


> phoneNumbers: [ { type: "work", value: "9-1-312-1234" } ,
> { type: "work", value: "123-456-7890" }]
> }
>
> Now here's where things get tricky. We'd like to give people a more in-depth full-profile view that expresses all of the metadata associated with a datasource, such as the name of the source, the update timestamp, and things like that. Furthermore, say you wanted to get *all* displayNames for a person, along with their data sources. This isn't too bad when you look at the plural fields, since we can just add a "source:" member to them. Even the complex fields aren't overly tricky for the same reason, but there's the problem of repeating a member name in an object (which I'm pretty sure will break a lot of parsers). The problem gets even worse with the simple singular fields, like displayName above.
>
> To express something like this, we'd almost certainly be returning something that isn't really PoCo, and that's fine with us. But we'd like to get it to stick somewhat close to the original if we can. I've been mulling this over and have come up with two possible solutions, neither of which I'm particularly happy about, so I wanted to throw it back to the PoCo community and see if anyone else has been thinking about something like this.
>
> First, we could have a parallel data structure named "sources" that simply iterates the source attribute for each component, as:
>
> {
> displayName: "Harold R. Smith",

> emails: [ { type: "work", value: "hsm...@example.com<mailto:hsm...@example.com>" } ]


> phoneNumbers: [ { type: "work", value: "123-456-7890" },
> { type: "work", value: "9-1-312-1234" } ]
>
> sources: {
> displayName: "primary",
> emails: "primary",
> phoneNumbers: ["primary", "secondary"]
> }
>
> This is a hacky way to get the attribution in while leaving the main structure alone. It doesn't let us do things like have a list of displayNames, though.
>
> Second, we could just extend all simple values with a multi-part data structure, extend all plural values with a source (and other) attributes, and wrap complex values just like the single values:
>
> {
> displayName: [{ value: "Harold R. Smith", source: "primary" }]

> emails: [ { type: "work", value: "hsm...@example.com<mailto:hsm...@example.com>", source: "primary"} ]


> phoneNumbers: [ { type: "work", value: "123-456-7890", source: "primary" },
> { type: "work", value: "9-1-312-1234", source: "secondary" } ]
> }
>
> This totally breaks assumptions on what's living at the end of each member name by throwing arrays of objects where strings once were, but it keeps the base structure the same.
>
> Third, we could skip the serverside aggregation in this case and just return the individual PoCo objects, as:
>
> {
> primary:
> {
> displayName: "Harold R. Smith",

> emails: [ { type: "work", value: "hsm...@example.com<mailto:hsm...@example.com>" } ]


> phoneNumbers: [ { type: "work", value: "123-456-7890" } ]
> },
> secondary:
> {
> displayName: "Bob Smith",
> phoneNumbers: [ {type: "work", value: "9-1-312-1234" } ]
> }
> }
>
> This basically puts all the metadata in the envelope, but it prevents the server from smushing things together in any intelligent way (such as eliminating duplicates).
>
> I'm still very open to suggestions.
> -- Justin
>
> --
> You received this message because you are subscribed to the Google Groups "PortableContacts" group.

> To post to this group, send email to portable...@googlegroups.com<mailto:portable...@googlegroups.com>.
> To unsubscribe from this group, send email to portablecontac...@googlegroups.com<mailto:portablecontacts%2Bunsu...@googlegroups.com>.

Joseph Smarr

unread,
Jul 5, 2011, 7:30:52 PM7/5/11
to portable...@googlegroups.com
Do those turn into valid JavaScript property names? If so, that's kind of elegant--basically just a very deterministic way to define metadata for singular fields that ends up effectively turning the singular field into an object with key/value properties. In any case, yes I'd definitely still keep a plain "displayName" so normal clients can do something sensible with it.

Justin Richer

unread,
Jul 6, 2011, 9:59:23 AM7/6/11
to portable...@googlegroups.com
You don't get to use dot notation because of the #, but you can use
array accessor notation. Thus:

p.displayName#primary

fails in Javascript, but:

p["displayName#primary"]

works just fine.

This also gets around the single-member-name restriction, since
"displayName#primary" and "displayName#secondary" are unique strings as
far as JS is concerned.

A colleague of mine has pointed out, though, that this does encode data
in the keys, and you don't get the kind of free parser stuff that you
would with normal json objects. IE, you need to do special construction
and splitting of the keys to get information about the values. I still
like this approach, but that's a known downfall.


I'm therefore proposing the following deterministic transformation rules
for metadata (in particular, data source in our case) on PoCo's three
kinds of values:

Simple values:

For source "src", append "#src" to the name of the element and insert
it into the parent object along side the unadorned value. An unadorned
value SHOULD be included to comply with the base specification.

Complex singular:

For source "src", append "#src" to the name of the element and insert
it into the parent object along side the unadorned value. An unadorned
value SHOULD be included to comply with the base specification.
[Additionally, a field of "source" with value "src" MAY be added to
this element.]

Complex plural:

For source "src", add a field of "source" with value to each element
in the plural object list. The root element (such as "emails")
remains unadorned.

PoCo does not define plural simple fields (ie, simple lists of strings),
so we leave that case undefined here as well.


-- Justin

> portablecontacts
> +unsub...@googlegroups.com<mailto:portablecontacts%

> To unsubscribe from this group, send email to portablecontacts
> +unsub...@googlegroups.com.


> For more options, visit this group at
> http://groups.google.com/group/portablecontacts?hl=en.
>
>
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "PortableContacts" group.
> To post to this group, send email to
> portable...@googlegroups.com.

> To unsubscribe from this group, send email to portablecontacts
> +unsub...@googlegroups.com.

Reply all
Reply to author
Forward
0 new messages