Hi everyone, I found browsing the sensorbase db that there's no foreign key constraint neither over the sensordata table's column 'sdt' or column 'owner'. In this way I could insert sdt names that are not stored in the sensordatatype table's column 'name', and I could insert owner names that are not stored in the hackyuser table's column 'email'. In fact my own installed sensors have already sent sdt not stored in the sdt table:
9 rows selected ij> select distinct name from sensordatatype; NAME ---------------------------------------------------------------- Commit TestSdt UnitTest
3 rows selected
I'd like to add these constraints to create a proper mapping (from relational db schema to rdf schema) file. I'm going to check for alternatives in order to avoid modifying the db schema (currently I've not yet checked), but in the mean time I'd like to know if the lack of these constraints has a particular reason behind or if you agree with the addition of them.
Thanks a lot in advance for any answer cheers myriam
I'm not that familiar with what you are doing, but I'm not entirely sure
that you want to always populate your RDF with the hackystat database
records. I think we have found (austen can help confirm this) that directly
accessing the database is a little slow. its a relational db but it isn't
very normalized. (well in our case we had lots of data so maybe this might
not be your operational environment)
What is your RDF structure going to look like? Rachel, another GSoCoder is
putting together a SocialMedia SDT that is pretty similar to a RDF triple;
see
http://syntacticsugah.blogspot.com/2009/06/gsoc-hackystat-may-25-june....
Actually what she is doing is pretty much RDF... I think there are some
overlaps there.
back to my concern. is your graph going to be persisted? i'm thinking if
you are creating a graph of the entire hackystat database you might want to
hook into the actual writing of the data to the database and just write to
your graph simutaneously. dumping the data from the derby database into the
rdf database might work initially, but i'm not certain thats the long term
solution.
i'm a little clueless about what you are working on and your plans. but i
will try to keep up with what you are doing. i've worked a little with RDF.
so things like dublin core rings a bell. another thing thats interesting is
your queries against the RDF database... from my experience as the graph
grows you can't possibly query the graph in a deterministic way if you want
a constant response time (i could be wrong of course). queries might have
to be stochastic.
anyway, sounds like cool stuff. i'll be reading your vision document one of
these days...
thanks, aaron
On Wed, Jun 3, 2009 at 11:25 PM, myriam leggieri
<myriam.leggi...@gmail.com>wrote:
> Hi everyone,
> I found browsing the sensorbase db that there's no foreign key constraint
> neither over the sensordata table's column 'sdt' or column 'owner'. In this
> way I could insert sdt names that are not stored in the sensordatatype
> table's column 'name', and I could insert owner names that are not stored in
> the hackyuser table's column 'email'. In fact my own installed sensors have
> already sent sdt not stored in the sdt table:
> 9 rows selected
> ij> select distinct name from sensordatatype;
> NAME
> ----------------------------------------------------------------
> Commit
> TestSdt
> UnitTest
> 3 rows selected
> I'd like to add these constraints to create a proper mapping (from
> relational db schema to rdf schema) file. I'm going to check for
> alternatives in order to avoid modifying the db schema (currently I've not
> yet checked), but in the mean time I'd like to know if the lack of these
> constraints has a particular reason behind or if you agree with the addition
> of them.
> Thanks a lot in advance for any answer
> cheers
> myriam
> I'm not that familiar with what you are doing, but I'm not entirely sure
> that you want to always populate your RDF with the hackystat database
> records. I think we have found (austen can help confirm this) that directly
> accessing the database is a little slow. its a relational db but it isn't
> very normalized. (well in our case we had lots of data so maybe this might
> not be your operational environment)
I made an investigation over relational and native triple-stores because (as
I stated at:
http://groups.google.com/group/hackystat-dev/browse_thread/thread/656...)
there are two ways to obtain a system optimized for the storage and
retrieval of RDF triples:
- store triples physically in a database. This is the so-called native
triple-storage system.
- store triple content in a relational database and use a "relational
database to RDF wrapper" which rewrites SPARQL queries into SQL queries
against an application-specific relational schemata based on a mapping
(which could be automatic (a restrictive method), manual or partially
manual). This is the so-called relational triple-storage system.
The result of that investigation has been linked in that discussion, too (
http://hackystat-linked-sensor-data.googlecode.com/files/hackystatLin...)
and from benchmark results it seems that relational triple-stores have
better performance than the native ones, and that between the relational
triple-stores the better is the D2RQ platform. That's why I'm using it and
I've generated an automatic mapping of the sensorbase db as a start point.
> What is your RDF structure going to look like? Rachel, another GSoCoder is
> putting together a SocialMedia SDT that is pretty similar to a RDF triple;
> see
> http://syntacticsugah.blogspot.com/2009/06/gsoc-hackystat-may-25-june....
> Actually what she is doing is pretty much RDF... I think there are some
> overlaps there.
I linked a first draft of the RDF schema at
http://groups.google.com/group/hackystat-dev/browse_thread/thread/656....
Maybe that discussion is so generic that the other developers stopped
being interested in it...I'm sorry.
With regards to the Rachel's work about her new 'SocialMedia' SDT, it seems
to me a sort of RDF wrapper of Ohloh, facebook, twitter and all the others
interesting materials not already exposed as RDF. But of course, I don't
know enough her project tasks and probably she is intending to do something
else. However it's not a limit to my project..I'll prevent the declaration
of new SDTs.
> back to my concern. is your graph going to be persisted? i'm thinking if
> you are creating a graph of the entire hackystat database you might want to
> hook into the actual writing of the data to the database and just write to
> your graph simutaneously. dumping the data from the derby database into the
> rdf database might work initially, but i'm not certain thats the long term
> solution.
The graph is not going to be persisted. The D2RQ Platform that I'm using has
been created for accessing non-RDF, relational databases as virtual
read-only RDF graphs. The D2RQ Engine includes a graph that wraps the
relational database into a virtual read-only RDF graph. It requires a
mapping file (which is customizable using the D2RQ Mapping Language) to
rewrite Jena API calls to SQL queries against the database and passes query
results up to the higher layers of the frameworks. The mapping file
describes the relation between an ontology and a relational data model.
The 'dump' to which I referred, is only a trick to get automatically a RDF
schema over the (always automatically generated) mapping file. They're both
automatically generated and then surely they represent just a start point.
In fact currently I've already begun modifying the mapping file to make it
gradually adhere to the "planned first draft of RDF Schema" (just linked
above). In particular the RDF Schema (actually automatically generated
through a 'dump-to-rdf' script) will be modified to follow the Linked data
principles and in the planned RDF schema are already contained possible
links to external existing vocabularies. Through these links lots of
external additional info will be available; while I need a mapping file to
properly access the relational db when info about raw sensordata is
required. Of course I could just use the sensorbaseclient to get this info,
but in this case I have to rewrite on my own Jena API calls and SPARQL
queries. Moreover the D2RQ Platform provides the D2R-Server which allows RDF
and HTML browsers to navigate the content of the database, and allows
applications to query the database using the SPARQL query language
(functionalities that would have been hard to realize using only the Jena
toolkit and the sensorbaseclient).
> i'm a little clueless about what you are working on and your plans. but i
> will try to keep up with what you are doing. i've worked a little with RDF.
> so things like dublin core rings a bell.
Thanks a lot for you interest! ;)
> another thing thats interesting is your queries against the RDF database...
> from my experience as the graph grows you can't possibly query the graph in
> a deterministic way if you want a constant response time (i could be wrong
> of course). queries might have to be stochastic.
I made that investigation and searched benchmarks because I imagine that the
amount of sensor data would be very large and then performance matters. I
hope that this will be enough...
> On Wed, Jun 3, 2009 at 11:25 PM, myriam leggieri <
> myriam.leggi...@gmail.com> wrote:
>> Hi everyone,
>> I found browsing the sensorbase db that there's no foreign key constraint
>> neither over the sensordata table's column 'sdt' or column 'owner'. In this
>> way I could insert sdt names that are not stored in the sensordatatype
>> table's column 'name', and I could insert owner names that are not stored in
>> the hackyuser table's column 'email'. In fact my own installed sensors have
>> already sent sdt not stored in the sdt table:
>> 9 rows selected
>> ij> select distinct name from sensordatatype;
>> NAME
>> ----------------------------------------------------------------
>> Commit
>> TestSdt
>> UnitTest
>> 3 rows selected
>> I'd like to add these constraints to create a proper mapping (from
>> relational db schema to rdf schema) file. I'm going to check for
>> alternatives in order to avoid modifying the db schema (currently I've not
>> yet checked), but in the mean time I'd like to know if the lack of these
>> constraints has a particular reason behind or if you agree with the addition
>> of them.
>> Thanks a lot in advance for any answer
>> cheers
>> myriam
> I'd like to add these constraints to create a proper mapping (from relational db schema > to rdf schema) file. I'm going to check for alternatives in order to avoid modifying > the db schema (currently I've not yet checked), but in the mean time I'd like to know > if the lack of these constraints has a particular reason behind or if you agree with > the addition of them.
Hi Myriam,
First off, I'm quite happy to see that you have figured this out! You are clearly making headway into understanding things "under the hood".
My answer is: The utterly denormalized nature of the default back-end for the sensorbase is entirely intentional. However, it is not a hard architectural constraint of the system; I think the Postgres implementation uses foreign keys (Aaron or Austen can correct me if I'm wrong).
The back story on the design of the default back-end is as follows: in Hackystat Versions 1 to 7, although we didn't have a relational DB back-end, we did maintain a fairly "normalized" repository. Specifically, we defined sensor data types with a set of required and optional fields, and we required all sensor instances of that type to have all required fields and no other fields that were not optional. In some sense, the system was "strongly typed". This worked great for about three weeks or so until we failed to specify an SDT correctly but didn't realize this until after we had already sent some data using the "wrong" structure. What to do? Our answer, around the time of Hackystat Version 6, and at considerable expense in time and complexity, was to create the ability to "evolve" sensor data types dynamically: you could attach "hooks" to the sensor data type definition that would dynamically update instances of that type upon retrieval (or upon arrival from a sensor that had not been updated with the new definition).
This was, if I might say so myself, a neat hack, but when it came time for Version 8 and an almost complete redesign of the system, I asked myself if I wanted to go down that road again, and myself answered me with a resounding "No way". It felt to me, frankly, that the cost of maintaining structural consistency in Version 7- was all out of proportion to the benefits, and I would rather rely on test cases and analysis code that was robust in the presence of non-conformant instances. (This is kind of similar to those Pythonistas who bawk at the type-level verbosity of Java.)
So, in Version 8, the SDT field is a simple String, which may or may not correspond to the name of a defined SDT. The SDT definition is only a "hint" or "recommendation" regarding the structure of sensor data instances that use that name. Some of the benefits of this approach are:
* You can send some instances of a brand new SDT (say, "Foo") and play around with its structure for a while until it "settles down" before defining the SDT.
* You can change the SDT associated with an instance easily.
* You can always enforce "strong typing" at the REST API level if you need it. I have always thought it would be good to have an option called "compliant=true" in the REST API sensor data retrieval functions. If the client provides that, then the sensorbase would only return those sensor data instances whose structure conformed to the current SDT definition. (This, in fact, was my design "escape hatch" in case the idea of loosey goosey typing turned out to be a horrible mistake.)
That said, let me revisit one of your comments:
> I'd like to add these constraints to create a proper mapping (from relational db schema > to rdf schema) file.
I do not think it is a good idea to create such a mapping, because the only "public" API for Hackystat services is the REST API. In the case of the Sensorbase, the relational DB schema is an "implementation detail", and in fact can be different depending upon which back-end is installed. So far, there have been three different back-end implementations for the sensorbase: a Derby back-end, a Postgres back-end, and a SQL server back-end, and I suspect each of them have different schemas.
Now, you might feel like this leaves you up a creek without a paddle, since if the SDTs can change arbitrarily, then how can you define an RDF? That is a good question, and what I suggest you do is to write an RDF that does not depend upon relational DB schemas, but rather is based upon "assumptions" about the structure of sensors of a given type. Analyses using the RDF have to discard sensor data that does not obey your assumptions.
Now, my hope is that as you develop your RDF and the analyses based upon them, you're going to come up with new constraints you would like to impose on the existing sensor data. In other words, I hope that by the end of the summer, you will be sending out emails that say things like, "Look guys, if you would please ensure that your Commit sensor data instances would include a field specifying the type of CM system (GIT vs. SVN vs. Mercurial vs. ClearCase vs. Perforce), then I could write a totally rocking analysis." That is how Hackystat has always evolved and gotten better: someone starts working at a high level, and as a result discovers changes to be made to the plumbing to make the system do new and cool things.
--On Thursday, June 04, 2009 12:07 AM -1000 Aaron Kagawa <kaga...@gmail.com> wrote:
> Greetings Myriam,
> I'm not that familiar with what you are doing, but I'm not entirely sure that you want > to always populate your RDF with the hackystat database records.
This is a good question: what is the actual relationship of the RDF to the sensor data instances (and project instances, and SDT instances, and user instances)?
I guess I don't yet get how you intend this to work. Are you intending to attach an RDF description to each instance in the sensorbase? (And, of course, to each instance in the DailyProjectData service, and each instance in the Telemetry service, and even analyses in the Software ICU, because these higher level services are where the more semantically interesting information occurs)?
If so, that seems worrisome to me, not the least because DPD and Telemetry instances are derived data and there can be a _lot_ of them.
I guess I was conceptualizing the RDF kind of existing off in its own space and generating queries to the various Hackystat services or something.
> back to my concern. is your graph going to be persisted? i'm thinking if you are > creating a graph of the entire hackystat database you might want to hook into the > actual writing of the data to the database and just write to your graph simutaneously.
Hmm. Again, I think we need to think in terms of "services", not "sensorbase".
One idea that occurs to me is to enable individual project owners to "subscribe" to Myriam's "Super Semantic RDF Service". This is the same approach we are using in Devcathlon. In Myriam's case, she would write a new Hackystat service that the owner of a Hackystat project would login to using their Hackystat credentials and indicate that a specified project would be available for RDF analysis. Now Myriam's service would have the project, the owner, and the owner's password, which would enable the service to go off and retrieve any/all sensor data associated with this project, as well as any DPD, Telemetry, Software ICU, Tickertape, etc. information of interest.
This pretty much gets around the issues of having to "dump" an entire database, and also finesses the security/privacy issues since the owner of a project has to "opt in" to the analysis.
The downside is that you don't get access to all of the (sensorbase) data immediately. I don't know if this is a bug or a feature.
> I think we have found (austen can help confirm this) that directly
accessing the database is a little slow. its a relational db but it
isn't very normalized. (well in our case we had lots of > data so
maybe this might not be your operational environment)
It was a while back, but I think we were trying to do some complex
queries so the performance suffered since the database wasn't
normalized. There was also an issue with data integrity. For
example, sensor data could exist for deleted users.
> My answer is: The utterly denormalized nature of the default back- end for the sensorbase
> is entirely intentional. However, it is not a hard architectural
constraint of the
> system; I think the Postgres implementation uses foreign keys
(Aaron or Austen can
> correct me if I'm wrong).
> --On Thursday, June 04, 2009 12:07 AM -1000 Aaron Kagawa <kaga...@gmail.com > > wrote:
>> Greetings Myriam,
>> I'm not that familiar with what you are doing, but I'm not entirely
>> sure that you want
>> to always populate your RDF with the hackystat database records.
> This is a good question: what is the actual relationship of the RDF
> to the sensor data
> instances (and project instances, and SDT instances, and user
> instances)?
> I guess I don't yet get how you intend this to work. Are you
> intending to attach an RDF
> description to each instance in the sensorbase? (And, of course, to
> each instance in the
> DailyProjectData service, and each instance in the Telemetry
> service, and even analyses
> in the Software ICU, because these higher level services are where
> the more semantically
> interesting information occurs)?
> If so, that seems worrisome to me, not the least because DPD and
> Telemetry instances are
> derived data and there can be a _lot_ of them.
> I guess I was conceptualizing the RDF kind of existing off in its
> own space and
> generating queries to the various Hackystat services or something.
>> back to my concern. is your graph going to be persisted? i'm
>> thinking if you are
>> creating a graph of the entire hackystat database you might want to
>> hook into the
>> actual writing of the data to the database and just write to your
>> graph simutaneously.
> Hmm. Again, I think we need to think in terms of "services", not
> "sensorbase".
> One idea that occurs to me is to enable individual project owners to
> "subscribe" to
> Myriam's "Super Semantic RDF Service". This is the same approach we
> are using in
> Devcathlon. In Myriam's case, she would write a new Hackystat
> service that the owner of
> a Hackystat project would login to using their Hackystat credentials
> and indicate that a
> specified project would be available for RDF analysis. Now Myriam's
> service would have
> the project, the owner, and the owner's password, which would enable
> the service to go
> off and retrieve any/all sensor data associated with this project,
> as well as any DPD,
> Telemetry, Software ICU, Tickertape, etc. information of interest.
> This pretty much gets around the issues of having to "dump" an
> entire database, and also
> finesses the security/privacy issues since the owner of a project
> has to "opt in" to the
> analysis.
> The downside is that you don't get access to all of the (sensorbase)
> data immediately. I
> don't know if this is a bug or a feature.
Hi Philip, it seems I was wrong while thinking to start from data stored in the sensorbasae db: thanks a lot for having warned me. Regarding the new fields potentially useful if added to data sent from sensors, there are already a lots of them thanks to some external vocabularies: I'll just have to filter the most useful of them ;)
That said, I have three choices to persist the RDF model: 1 in-memory 2 relational db 3 filesystem
the 2th and 3th choices are excluded because of the reasons that you gave: the db schema could change and there would bee too many DPD and Telemetry instances to be stored. Considering that
> I guess I was conceptualizing the RDF kind of existing off in its own space > and > generating queries to the various Hackystat services or something.
> the in-memory storage consists in querying the Hackystat services and
inserting the query results into different statements (which is a triple of kind 'subject-predicate-object') and then add them to a run-time created model. The model would be repopulated from scratch each time the application was launched. This is of course, really time and resource-consuming especially on large amounts of data.
The only solution I see consists in a hybrid one, using the cache and the in-memory mechanisms; and is described below. Initially is provided to users only a really high level graph, showing only the class hierarchy. Then a user could choice to get a detailed view for only the particular parts of the graph that - requires data already added to a cached model and - requires to query only one Hackystat service. The query result is then added to a model that is cached in a way similar to the DPD and Telemetry caching mechanism.
This would limit the time and resource costs.
This is a good question: what is the actual relationship of the RDF to the
> sensor data > instances (and project instances, and SDT instances, and user instances)?
I've created a sample project which illustrates the RDF representation of Hackystat project and user data, and only 'Code Issue' sensor data. In other words the application: - gets project, user and sensor data having 'CodeIssue' as type, through the sensorBaseClient; - adds them to a run-time created model (using also external vocabularies as indicated in the schema) within triples. The model is initialized with a schema. - lists all the statements - lists only the statements having RDF:type as predicate and CodeIssue as object - lists only the statements having RDF:type as predicate and CodeQuality as object (this requires a reasoner to properly access the hierarchy through the subClassOf statements). (see the class ModelConstructionExample.java: http://code.google.com/p/hackystat-linked-sensor-data/source/browse/t... )
To be even more clear I'll split the RDF schema already created in smaller (and then less complex) sections. Obviously I will also complete the sample application according to your opinion about the idea previously described. What do you think about it?