Gmail Calendar Documents Reader Web more »
Recently Visited Groups | Help | Sign in
Google Groups Home
database foreign key constraints question
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  9 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
myriam leggieri  
View profile  
 More options Jun 4 2009, 5:25 am
From: myriam leggieri <myriam.leggi...@gmail.com>
Date: Wed, 3 Jun 2009 23:25:29 -1000
Local: Thurs, Jun 4 2009 5:25 am
Subject: database foreign key constraints question

Hi everyone,
I found browsing the sensorbase db that there's no foreign key constraint
neither over the sensordata table's column 'sdt' or column 'owner'. In this
way I could insert sdt names that are not stored in the sensordatatype
table's column 'name', and I could insert owner names that are not stored in
the hackyuser table's column 'email'. In fact my own installed sensors have
already sent sdt not stored in the sdt table:

ij> select distinct sdt from sensordata;
SDT
----------------------------------------------------------------
SampleSdt
TestSdt
UnitTest
Coupling
Coverage
Commit
CodeIssue
Build
FileMetric

9 rows selected
ij> select distinct name from sensordatatype;
NAME
----------------------------------------------------------------
Commit
TestSdt
UnitTest

3 rows selected

I'd like to add these constraints to create a proper mapping (from
relational db schema to rdf schema) file. I'm going to check for
alternatives in order to avoid modifying the db schema (currently I've not
yet checked), but in the mean time I'd like to know if the lack of these
constraints has a particular reason behind or if you agree with the addition
of them.

Thanks a lot in advance for any answer
cheers
myriam


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Aaron Kagawa  
View profile  
 More options Jun 4 2009, 6:07 am
From: Aaron Kagawa <kaga...@gmail.com>
Date: Thu, 4 Jun 2009 00:07:06 -1000
Local: Thurs, Jun 4 2009 6:07 am
Subject: Re: [hackystat-dev] database foreign key constraints question

Greetings Myriam,

I'm not that familiar with what you are doing, but I'm not entirely sure
that you want to always populate your RDF with the hackystat database
records. I think we have found (austen can help confirm this) that directly
accessing the database is a little slow. its a relational db but it isn't
very normalized. (well in our case we had lots of data so maybe this might
not be your operational environment)

What is your RDF structure going to look like?  Rachel, another GSoCoder is
putting together a SocialMedia SDT that is pretty similar to a RDF triple;
see
http://syntacticsugah.blogspot.com/2009/06/gsoc-hackystat-may-25-june....
Actually what she is doing is pretty much RDF... I think there are some
overlaps there.

back to my concern.   is your graph going to be persisted?  i'm thinking if
you are creating a graph of the entire hackystat database you might want to
hook into the actual writing of the data to the database and just write to
your graph simutaneously. dumping the data from the derby database into the
rdf database might work initially, but i'm not certain thats the long term
solution.

i'm a little clueless about what you are working on and your plans. but i
will try to keep up with what you are doing. i've worked a little with RDF.
so things like dublin core rings a bell. another thing thats interesting is
your queries against the RDF database... from my experience as the graph
grows you can't possibly query the graph in a deterministic way if you want
a constant response time (i could be wrong of course).  queries might have
to be stochastic.

anyway, sounds like cool stuff.  i'll be reading your vision document one of
these days...

thanks, aaron

On Wed, Jun 3, 2009 at 11:25 PM, myriam leggieri
<myriam.leggi...@gmail.com>wrote:


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
myriam leggieri  
View profile  
 More options Jun 4 2009, 9:12 am
From: myriam leggieri <myriam.leggi...@gmail.com>
Date: Thu, 4 Jun 2009 03:12:48 -1000
Local: Thurs, Jun 4 2009 9:12 am
Subject: Re: [hackystat-dev] Re: database foreign key constraints question

Hi Aaron,
thanks a lot for your answer.

> I'm not that familiar with what you are doing, but I'm not entirely sure
> that you want to always populate your RDF with the hackystat database
> records. I think we have found (austen can help confirm this) that directly
> accessing the database is a little slow. its a relational db but it isn't
> very normalized. (well in our case we had lots of data so maybe this might
> not be your operational environment)

I made an investigation over relational and native triple-stores because (as
I stated at:
http://groups.google.com/group/hackystat-dev/browse_thread/thread/656...)
there are two ways to obtain a system optimized for the storage and
retrieval of RDF triples:
- store triples physically in a database. This is the so-called native
triple-storage system.
- store triple content in a relational database and use a "relational
database to RDF wrapper" which rewrites SPARQL queries into SQL queries
against an application-specific relational schemata based on a mapping
(which could be automatic (a restrictive method), manual or partially
manual). This is the so-called relational triple-storage system.

The result of that investigation has been linked in that discussion, too (
http://hackystat-linked-sensor-data.googlecode.com/files/hackystatLin...)
and from benchmark results it seems that relational triple-stores have
better performance than the native ones, and that between the relational
triple-stores the better is the D2RQ platform. That's why I'm using it and
I've generated an automatic mapping of the sensorbase db as a start point.

> What is your RDF structure going to look like?  Rachel, another GSoCoder is
> putting together a SocialMedia SDT that is pretty similar to a RDF triple;
> see
> http://syntacticsugah.blogspot.com/2009/06/gsoc-hackystat-may-25-june....
> Actually what she is doing is pretty much RDF... I think there are some
> overlaps there.

I linked a first draft of the RDF schema at
http://groups.google.com/group/hackystat-dev/browse_thread/thread/656....
Maybe that discussion is so generic that the other developers stopped
being interested in it...I'm sorry.
With regards to the Rachel's work about her new 'SocialMedia' SDT, it seems
to me a sort of RDF wrapper of Ohloh, facebook, twitter and all the others
interesting materials not already exposed as RDF. But of course, I don't
know enough her project tasks and probably she is intending to do something
else. However it's not a limit to my project..I'll prevent the declaration
of new SDTs.

> back to my concern.   is your graph going to be persisted?  i'm thinking if
> you are creating a graph of the entire hackystat database you might want to
> hook into the actual writing of the data to the database and just write to
> your graph simutaneously. dumping the data from the derby database into the
> rdf database might work initially, but i'm not certain thats the long term
> solution.

The graph is not going to be persisted. The D2RQ Platform that I'm using has
been created for accessing non-RDF, relational databases as virtual
read-only RDF graphs. The D2RQ Engine includes a graph that wraps the
relational database into a virtual read-only RDF graph. It requires a
mapping file (which is customizable using the D2RQ Mapping Language) to
rewrite Jena API calls to SQL queries against the database and passes query
results up to the higher layers of the frameworks. The mapping file
describes the relation between an ontology and a relational data model.
The 'dump' to which I referred, is only a trick to get automatically a RDF
schema over the (always automatically generated) mapping file. They're both
automatically generated and then surely they represent just a start point.
In fact currently I've already begun modifying the mapping file to make it
gradually adhere to the "planned first draft of RDF Schema" (just linked
above). In particular the RDF Schema (actually automatically generated
through a 'dump-to-rdf' script) will be modified to follow the Linked data
principles and in the planned RDF schema are already contained possible
links to external existing vocabularies. Through these links lots of
external additional info will be available; while I need a mapping file to
properly access the relational db when info about raw sensordata is
required. Of course I could just use the sensorbaseclient to get this info,
but in this case I have to rewrite on my own Jena API calls and SPARQL
queries. Moreover the D2RQ Platform provides the D2R-Server which allows RDF
and HTML browsers to navigate the content of the database, and allows
applications to query the database using the SPARQL query language
(functionalities that would have been hard to realize using only the Jena
toolkit and the sensorbaseclient).

> i'm a little clueless about what you are working on and your plans. but i
> will try to keep up with what you are doing. i've worked a little with RDF.
> so things like dublin core rings a bell.

Thanks a lot for you interest! ;)

> another thing thats interesting is your queries against the RDF database...
> from my experience as the graph grows you can't possibly query the graph in
> a deterministic way if you want a constant response time (i could be wrong
> of course).  queries might have to be stochastic.

I made that investigation and searched benchmarks because I imagine that the
amount of sensor data would be very large and then performance matters. I
hope that this will be enough...

> anyway, sounds like cool stuff.

Thanks a lot. I'll do my best ;)
cheers
myriam


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
myriam leggieri  
View profile  
 More options Jun 4 2009, 9:16 am
From: myriam leggieri <myriam.leggi...@gmail.com>
Date: Thu, 4 Jun 2009 03:16:10 -1000
Local: Thurs, Jun 4 2009 9:16 am
Subject: Re: [hackystat-dev] Re: database foreign key constraints question

> intending to do something else. However it's not a limit to my
> project..I'll prevent the declaration of new SDTs.

.... sorry I meant "I'll foresee the declaration of new SDTs" :p

    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Philip Johnson  
View profile  
 More options Jun 4 2009, 7:07 pm
From: Philip Johnson <john...@hawaii.edu>
Date: Thu, 04 Jun 2009 13:07:16 -1000
Local: Thurs, Jun 4 2009 7:07 pm
Subject: Re: [hackystat-dev] database foreign key constraints question

--On Wednesday, June 03, 2009 11:25 PM -1000 myriam leggieri <myriam.leggi...@gmail.com>
wrote:

> I'd like to add these constraints to create a proper mapping (from relational db schema
> to rdf schema) file. I'm going to check for alternatives in order to avoid modifying
> the db schema (currently I've not yet checked), but in the mean time I'd like to know
> if the lack of these constraints has a particular reason behind or if you agree with
> the addition of them.

Hi Myriam,

First off, I'm quite happy to see that you have figured this out!  You are clearly making
headway into understanding things "under the hood".

My answer is: The utterly denormalized nature of the default back-end for the sensorbase
is entirely intentional. However, it is not a hard architectural constraint of the
system; I think the Postgres implementation uses foreign keys (Aaron or Austen can
correct me if I'm wrong).

The back story on the design of the default back-end is as follows: in Hackystat Versions
1 to 7, although we didn't have a relational DB back-end, we did maintain a fairly
"normalized" repository.  Specifically, we defined sensor data types with a set of
required and optional fields, and we required all sensor instances of that type to have
all required fields and no other fields that were not optional.  In some sense, the
system was "strongly typed".  This worked great for about three weeks or so until we
failed to specify an SDT correctly but didn't realize this until after we had already
sent some data using the "wrong" structure.  What to do?  Our answer, around the time of
Hackystat Version 6, and at considerable expense in time and complexity, was to create
the ability to "evolve" sensor data types dynamically: you could attach "hooks" to the
sensor data type definition that would dynamically update instances of that type upon
retrieval (or upon arrival from a sensor that had not been updated with the new
definition).

This was, if I might say so myself, a neat hack, but when it came time for Version 8 and
an almost complete redesign of the system, I asked myself if I wanted to go down that
road again, and myself answered me with a resounding "No way".  It felt to me, frankly,
that the cost of maintaining structural consistency in Version 7- was all out of
proportion to the benefits, and I would rather rely on test cases and analysis code that
was robust in the presence of non-conformant instances.  (This is kind of similar to
those Pythonistas who bawk at the type-level verbosity of Java.)

So, in Version 8, the SDT field is a simple String, which may or may not correspond to
the name of a defined SDT.  The SDT definition is only a "hint" or "recommendation"
regarding the structure of sensor data instances that use that name.   Some of the
benefits of this approach are:

* You can send some instances of a brand new SDT (say, "Foo") and play around with its
structure for a while until it "settles down" before defining the SDT.

* You can change the SDT associated with an instance easily.

* You can always enforce "strong typing" at the REST API level if you need it.  I have
always thought it would be good to have an option called "compliant=true" in the REST API
sensor data retrieval functions.  If the client provides that, then the sensorbase would
only return those sensor data instances whose structure conformed to the current SDT
definition.  (This, in fact, was my design "escape hatch" in case the idea of loosey
goosey typing turned out to be a horrible mistake.)

That said, let me revisit one of your comments:

> I'd like to add these constraints to create a proper mapping (from relational db schema
> to rdf schema) file.

I do not think it is a good idea to create such a mapping, because the only "public" API
for Hackystat services is the REST API.  In the case of the Sensorbase, the relational DB
schema is an "implementation detail", and in fact can be different depending upon which
back-end is installed. So far, there have been three different back-end implementations
for the sensorbase: a Derby back-end, a Postgres back-end, and a SQL server back-end, and
I suspect each of them have different schemas.

Now, you might feel like this leaves you up a creek without a paddle, since if the SDTs
can change arbitrarily, then how can you define an RDF? That is a good question, and what
I suggest you do is to write an RDF that does not depend upon relational DB schemas, but
rather is based upon "assumptions" about the structure of sensors of a given type.
Analyses using the RDF have to discard sensor data that does not obey your assumptions.

Now, my hope is that as you develop your RDF and the analyses based upon them, you're
going to come up with new constraints you would like to impose on the existing sensor
data.  In other words, I hope that by the end of the summer, you will be sending out
emails that say things like, "Look guys, if you would please ensure that your Commit
sensor data instances would include a field specifying the type of CM system (GIT vs. SVN
vs. Mercurial vs. ClearCase vs. Perforce), then I could write a totally rocking
analysis."   That is how Hackystat has always evolved and gotten better:  someone starts
working at a high level, and as a result discovers changes to be made to the plumbing to
make the system do new and cool things.

Philip


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Philip Johnson  
View profile  
 More options Jun 4 2009, 7:26 pm
From: Philip Johnson <john...@hawaii.edu>
Date: Thu, 04 Jun 2009 13:26:10 -1000
Local: Thurs, Jun 4 2009 7:26 pm
Subject: Re: [hackystat-dev] Re: database foreign key constraints question

--On Thursday, June 04, 2009 12:07 AM -1000 Aaron Kagawa <kaga...@gmail.com> wrote:

> Greetings Myriam,

> I'm not that familiar with what you are doing, but I'm not entirely sure that you want
> to always populate your RDF with the hackystat database records.

This is a good question: what is the actual relationship of the RDF to the sensor data
instances (and project instances, and SDT instances, and user instances)?

I guess I don't yet get how you intend this to work.  Are you intending to attach an RDF
description to each instance in the sensorbase? (And, of course, to each instance in the
DailyProjectData service, and each instance in the Telemetry service, and even analyses
in the Software ICU, because these higher level services are where the more semantically
interesting information occurs)?

If so, that seems worrisome to me, not the least because DPD and Telemetry instances are
derived data and there can be a _lot_ of them.

I guess I was conceptualizing the RDF kind of existing off in its own space and
generating queries to the various Hackystat services or something.

> back to my concern.   is your graph going to be persisted?  i'm thinking if you are
> creating a graph of the entire hackystat database you might want to hook into the
> actual writing of the data to the database and just write to your graph simutaneously.

Hmm.  Again, I think we need to think in terms of "services", not "sensorbase".

One idea that occurs to me is to enable individual project owners to "subscribe" to
Myriam's "Super Semantic RDF Service".  This is the same approach we are using in
Devcathlon.  In Myriam's case, she would write a new Hackystat service that the owner of
a Hackystat project would login to using their Hackystat credentials and indicate that a
specified project would be available for RDF analysis.  Now Myriam's service would have
the project, the owner, and the owner's password, which would enable the service to go
off and retrieve any/all sensor data associated with this project, as well as any DPD,
Telemetry, Software ICU, Tickertape, etc. information of interest.

This pretty much gets around the issues of having to "dump" an entire database, and also
finesses the security/privacy issues since the owner of a project has to "opt in" to the
analysis.

The downside is that you don't get access to all of the (sensorbase) data immediately.  I
don't know if this is a bug or a feature.

Philip


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Austen Ito  
View profile  
 More options Jun 5 2009, 3:25 am
From: Austen Ito <austen....@gmail.com>
Date: Thu, 4 Jun 2009 21:25:59 -1000
Local: Fri, Jun 5 2009 3:25 am
Subject: Re: [hackystat-dev] Re: database foreign key constraints question
 > I think we have found (austen can help confirm this) that directly  
accessing the database is a little slow. its a relational db but it  
isn't very normalized. (well in our case we had lots of > data so  
maybe this might not be your operational environment)

It was a while back, but I think we were trying to do some complex  
queries so the performance suffered since the database wasn't  
normalized.  There was also an issue with data integrity.  For  
example, sensor data could exist for deleted users.

 > My answer is: The utterly denormalized nature of the default back-
end for the sensorbase
 > is entirely intentional. However, it is not a hard architectural  
constraint of the
 > system; I think the Postgres implementation uses foreign keys  
(Aaron or Austen can
 > correct me if I'm wrong).

Yes they do.  The hackystat-sensorbase-postgres wiki has a schema  
image: http://code.google.com/p/hackystat-sensorbase-postgres/wiki/DatabaseS...

austen

On Jun 4, 2009, at 1:26 PM, Philip Johnson wrote:


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
myriam leggieri  
View profile  
 More options Jun 8 2009, 1:06 pm
From: myriam leggieri <myriam.leggi...@gmail.com>
Date: Mon, 8 Jun 2009 07:06:03 -1000
Local: Mon, Jun 8 2009 1:06 pm
Subject: Re: [hackystat-dev] Re: database foreign key constraints question

Hi Philip,
it seems I was wrong while thinking to start from data stored in the
sensorbasae db: thanks a lot for having warned me. Regarding the new fields
potentially useful if added to data sent from sensors, there are already a
lots of them thanks to some external vocabularies: I'll just have to filter
the most useful of them ;)

That said, I have three choices to persist the RDF model:
1 in-memory
2 relational db
3 filesystem

the 2th and 3th choices are excluded because of the reasons that you gave:
the db schema could change and there would bee too many DPD and Telemetry
instances to be stored.
Considering that

> I guess I was conceptualizing the RDF kind of existing off in its own space
> and
> generating queries to the various Hackystat services or something.

> the in-memory storage consists in querying the Hackystat services and

inserting the query results into different statements (which is a triple of
kind 'subject-predicate-object') and then add them to a run-time created
model. The model would be repopulated from scratch each time the application
was launched. This is of course, really time and resource-consuming
especially on large amounts of data.

The only solution I see consists in a hybrid one, using the cache and the
in-memory mechanisms; and is described below.
Initially is provided to users only a really high level graph, showing only
the class hierarchy. Then a user could choice to get a detailed view for
only the particular parts of the graph that
- requires data already added to a cached model
and
- requires to query only one Hackystat service. The query result is then
added to a model that is cached in a way similar to the DPD and Telemetry
caching mechanism.

This would limit the time and resource costs.

This is a good question: what is the actual relationship of the RDF to the

> sensor data
> instances (and project instances, and SDT instances, and user instances)?

I've created a sample project which illustrates the RDF representation of
Hackystat project and user data, and  only 'Code Issue' sensor data. In
other words the application:
- gets project, user and sensor data having 'CodeIssue' as type, through the
sensorBaseClient;
- adds them to a run-time created model (using also external vocabularies as
indicated in the schema) within triples. The model is initialized with a
schema.
- lists all the statements
- lists only the statements having RDF:type as predicate and CodeIssue as
object
- lists only the statements having RDF:type as predicate and CodeQuality as
object (this requires a reasoner to properly access the hierarchy through
the subClassOf statements).
(see the class ModelConstructionExample.java:
http://code.google.com/p/hackystat-linked-sensor-data/source/browse/t...
)

To be even more clear I'll split the RDF schema already created in smaller
(and then less complex) sections. Obviously I will also complete the sample
application according to your opinion about the idea previously described.
What do you think about it?


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Philip Johnson  
View profile  
 More options Jun 8 2009, 3:42 pm
From: Philip Johnson <john...@hawaii.edu>
Date: Mon, 08 Jun 2009 09:42:34 -1000
Local: Mon, Jun 8 2009 3:42 pm
Subject: Re: [hackystat-dev] Re: database foreign key constraints question

--On Monday, June 08, 2009 7:06 AM -1000 myriam leggieri <myriam.leggi...@gmail.com>
wrote:

> What do you think about it?

Sounds great, go for it!

Philip


    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2010 Google