Database Selection for DataObjects and derivatives

146 views
Skip to first unread message

Guus Leeuw jr.

unread,
May 2, 2012, 3:22:02 AM5/2/12
to silverst...@googlegroups.com
Hello,

having looked at SilverStripe from a high level and done some tutorials
with various stages of stable and development releases, I come to one big
conclusion: SilverStripe immediately grabbed my attention in terms of the
backend CMS: neatly packaged, oversightful, easy to understand, no big
mistery how the page structure etc relates to frontend / published pages.
Even (for a techie) the theming mechanism is understandable from the
moment "Go".

So I've decided to delve into a bit more and start looking at my website's
architecture:
* Public Web
* Shop (with its own customer database, which we may keep separate yet)
* Non-Public Website that serves as a customer portal
* A CMS somewhere

Especially the non-public stuff (which is used to host internal (admin)
applications as well bug tracker, projects, timesheets etc applications)
would be an area of interest:

* It is highly role (authz) based, and login (authc) depends on an
internal database of username / password IDs.
* Applications tend to have their own database and REST interface for
other applications to pull data (e.g. a timesheet application pulls data
from the projects database to show only projects that the current user is
actually a noted team member of to record time against).

Leaving the authc and authz aside for the moment, I delved into the
database mechanism behind SilverStripe and came to the conclusion that
SilverStripe uses one database only, namely the main one identified during
installation, and noted $databaseConfig in mysite/_config.php.

What would you think of a method, built in to the framework, whereby each
DataObject and derivative can choose the database configuration it wants
to be stored in and gotten from? Much along the lines of Dependency
Injection, we can have a static $custom_database_connection (function /
singleton / variable) that is by default the "normal" (currently global)
database connection, and allows for overloading in descendants to suite
the descendants' need for its own database.

Allowing that sort of thing, would keep all the framework (in terms of
Model use by View and Controller) in tact, because each DataObject decides
in which database it wants to sit.

The database building (/dev/build) would only control the "normal"
database connection, so that SilverStripe only updates its own database.

The whole power of SilverStripe would be maintained, but it would be
feasible and easy to "pull in" data that sits in an external (to
SilverStripe) datastore. We can even dream up a file based datastore for
SilverStripe this way, because the only thing that would be necessary is
to create a descendant of SS_Database to handle the low level stuff of a
specialised database (like a file store).

Also, it would allow me to pull in (in the same manner as the current
security objects) authc and authz information from my current database
(rather than having to move all users to SilverStripe). My current auth
database handles and is used for: SSH logins, HR knowledge, email
(including forwarding rules), and employee / client logins to my
website(s) and the email infrastructure.
(Or more easily implement a MembershipProvider / RoleProvider /
ProfileProvider type structure akin to .NET...)

Any thoughts?

Best Regards,
Guus

Marcus Nyeholt

unread,
May 2, 2012, 10:05:08 PM5/2/12
to silverst...@googlegroups.com
Hi Guus,

At the moment most of the core of SS is very much single DB connection minded; most core classes uses the DB:: static method calls for database interaction, which all use a 'default' database connection. While it might be possible to manage the swapping around of what the 'defaut' connection is externally from the core, I'd imagine that it would only be successful for a limited number of scenarios. So the first hurdle to overcome would be to change core classes to use either instance methods, or make the DB:: static methods accept a different db connection identifier, either of which is non-trivial.

A second big problem with using different DB connections per object type is that the ORM can no longer manage related data properly (through $has_one, $many_many etc) as it'd be trying to join across databases, and if the type was inherited, all types in the hierarchy would need to be stored in the same DB as joins occur when loading inherited types. While there are ways around all of this (doing joins in code, etc), it's not ideal.

So in short, while an interesting idea, the amount of change required is pretty significant and not one that I'd want to undertake. 

However, back to the actual problem that it looks like you're trying to solve, because it's one I've frequently come across myself and have built an external content connector to help out https://github.com/nyeholt/silverstripe-external-content 

The idea is that you write an adaptor for the external content source, and a wrapper (or wrappers if needbe) for the external content items. The ExternalContentItem is a DataObject subclass, so can be used within SS much as you would use a data object. There's also mechanisms for importing the objects into a native SS data object representation - for example bringing through Wordpress content into SilverStripe. You could do something similar to wrap your external data sources

Some example connectors to get an idea of what's involved



Cheers,

Marcus


--
You received this message because you are subscribed to the Google Groups "SilverStripe Core Development" group.
To post to this group, send email to silverst...@googlegroups.com.
To unsubscribe from this group, send email to silverstripe-d...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/silverstripe-dev?hl=en.


Sam Minnée

unread,
May 3, 2012, 1:19:02 AM5/3/12
to silverst...@googlegroups.com
On 3/05/2012, at 2:05 PM, Marcus Nyeholt wrote:

Hi Guus,


At the moment most of the core of SS is very much single DB connection minded; most core classes uses the DB:: static method calls for database interaction, which all use a 'default' database connection. While it might be possible to manage the swapping around of what the 'defaut' connection is externally from the core, I'd imagine that it would only be successful for a limited number of scenarios. So the first hurdle to overcome would be to change core classes to use either instance methods, or make the DB:: static methods accept a different db connection identifier, either of which is non-trivial.

DataList could potentially have a dbConn property; this could be set to DB::getConn(), or to something else, with DI.
DataList::toArray() would need to $this->dbConn->query($this->dataQuery->sql().

The more significant change is to ensure that direct SQLQuery construction isn't used in the rest of core, but that should be achievable and is a worthwhile bit of housekeeping apart from this change.

A second big problem with using different DB connections per object type is that the ORM can no longer manage related data properly (through $has_one, $many_many etc) as it'd be trying to join across databases, and if the type was inherited, all types in the hierarchy would need to be stored in the same DB as joins occur when loading inherited types. While there are ways around all of this (doing joins in code, etc), it's not ideal.

The most straightforward solution would be to throw exception when you joined across different databases.  This would let you make use of multiple databases as long as you managed the partitioning yourself.  There are many situations where you might have a coherent but loosely coupled set of models running on another database, and being forced not to run joins across multiple databases would be an acceptable limitation.

Joining in PHP would be helpful in other contexts (e.g. linking a relational database to some kind of web API), but definitely a lot more work.  I would be inclined to approach its implementation from the perspective of "how can we execute these joins using only the SS_List API?"

However, back to the actual problem that it looks like you're trying to solve, because it's one I've frequently come across myself and have built an external content connector to help out https://github.com/nyeholt/silverstripe-external-content 

The idea is that you write an adaptor for the external content source, and a wrapper (or wrappers if needbe) for the external content items. The ExternalContentItem is a DataObject subclass, so can be used within SS much as you would use a data object. There's also mechanisms for importing the objects into a native SS data object representation - for example bringing through Wordpress content into SilverStripe. You could do something similar to wrap your external data sources

Some example connectors to get an idea of what's involved


Totally agree. :-)

Marcus Nyeholt

unread,
May 3, 2012, 1:35:09 AM5/3/12
to silverst...@googlegroups.com
Ah yes - I should have qualified my reply in context of 2.4, not 3.0, which I don't know as well internally.


There's probably a school of research out there about managing data joins across disparate data sets, which would probably be quite suited to a map/reduce type solution. Either way, not a trivial solution

Cheers,

Marcus

--

Guus Leeuw jr

unread,
May 3, 2012, 4:04:55 AM5/3/12
to silverst...@googlegroups.com

Hello,

 

To clarify one thing right up front: I speak of data store as opposed to data base: the data store can be anything, whereas a data base is generally some form of SQL. The distinction is in the fact that I see my idea as a general store-agnostic way of accessing data in SilverStripe… That is to say, one could come up with: SalesforceDatabase, LdapDatabase, MySQLDatabase, PostgressDatabase, RESTDatabase, SOAPDatabase, SharePointDatabase but also (see later) MySQL_LdapDatabase, or MySQL_RESTDatabase. (We might need to abstract SS_Database into SS_Datastore, but we’ll cross that bridge when we get there.)

 

Looking at DataList, the dbConn that Sam proposed should be more of a function, so that it can pull $this->dataClass->$db_connection(). DataObject::db_connection() then provides a two element array: Type and Config, which by default is array(‘Type’ => ‘MySQL’, ‘Config’ => $databaseConfig). DataList then understands that ‘Type’Database is the class / object to do the querying and *it* in turn understands that ‘Config’ tells it how to connect to the data store.

 

The only area that is indeed untouched by this scheme is direct querying through SQLQuery. Since SQLQuery is such a specialised class and case, I would not worry so much about its mechanism to connect to the right data store. However, the core needs to move away from direct SQLQuery usage *or* we need a factory design pattern which one can pool connections to data stores and make them accessible much like DataModel makes a collection of DataLists accessible. (Connecting to a data store is always probably the most expensive (in terms of time) operation, so using the same data store multiple times must happen without incurring that connection cost more than once.) If we pool connections like that, the two-element array in DataObject::db_connection() becomes a three-element array: array(‘Name’ => ‘ProjectDB’, ‘Type’ => ‘MySQL’, ‘Config’ => array(‘Host’ => ‘mysql.example.com:3307’, ‘User’ => ‘projects’, ‘Pass’ => ‘projects’, ‘Schema’ => ‘project’)), so that the DataConnections can __get and __set this connection information, create the appropriate class on first __set, make sure it is connected to the data store, and set that object within  the connection array passed through DataList.

 

So we now solved the part one problems of Marcus’ initial reply. (other than it still is a big chunk of work J)

 

On to the second problem: Intra-data-store joins. Agree with Sam that we can probably throw an Exception saying: Intra-Data-Store Join is not possible. However, it would be much more elegant if we could somehow entice the driver of the join (i.e. the outer most class) to join the inner-next class’s data with itself / or the inner most class to join the outer-next class’s data with itself. Need to think about this though, as it will incur some sort of big-time performance penalty (think a huge number of rows)… Databases themselves are obviously better at joining than PHP. But, aha, the SS_Database abstraction we just clarified above, could construe a MySQL_LdapDatabase class that takes care of joins between the two (and is a specialised class handling specialised data store requirements based on whatever the users’ needs are, so SS does not have to care about it at all, since all SS data always comes from one data store). The way to go about that would be to have a DataObject with a DualName, i.e. StaffCommission where Staff comes from LDAP, and Commission is a derivative of Salesforce.com and its db_connection is the LDAP_SalesforceDatabase that knows how to calculate Commission (sales volume) for each Sales person (staff) on a monthly basis by meshing the two distinct data stores… Solutions are endless with this approach… one can then even dream of specialised external data modules that provide this sort of glue…

 

Solved that too, then by leaving special cases to people who need them, but provide the glue in SS to enable people to create these special cases.

 

Last item that would be of interest, again, is /dev/build, whereby that process only takes care of DataObject descendants that actually use the ‘Default’ data store. (‘Default’ is now the name of the three-way array used in DataConnections from the first part of the solution), and created (in mysite/_config.php) by the installer, much like it is now.

 

I’m not scared of the amount of work as such. I am, however, not really nimble around git just yet, but given time that will surely come. (I found http://doc.silverstripe.org/sapphire/en/trunk/misc/contributing)

 

More thoughts?

 

Regards,

Guus

--

Sam Minnée

unread,
May 3, 2012, 6:52:37 AM5/3/12
to silverst...@googlegroups.com, <silverstripe-dev@googlegroups.com>
Guus, I suggest you spend a bit more time reading up on SS_List, DataList, and DataQuery. Much of the abstraction you speak of already exists. DataList is intended to be specific to relational databases; SS_List is more broadly applicable to datastores. Facilities such as GridField are built to rely on the SS_List interface.

Sam Minnée

unread,
May 3, 2012, 10:04:09 PM5/3/12
to silverst...@googlegroups.com
On 3/05/2012, at 8:04 PM, Guus Leeuw jr wrote:

Hello,
 

To clarify one thing right up front: I speak of data store as opposed to data base: the data store can be anything, whereas a data base is generally some form of SQL. The distinction is in the fact that I see my idea as a general store-agnostic way of accessing data in SilverStripe… That is to say, one could come up with: SalesforceDatabase, LdapDatabase, MySQLDatabase, PostgressDatabase, RESTDatabase, SOAPDatabase, SharePointDatabase but also (see later) MySQL_LdapDatabase, or MySQL_RESTDatabase. (We might need to abstract SS_Database into SS_Datastore, but we’ll cross that bridge when we get there.)

DataList, with the help of DataQuery, is designed to connect to relational databases using SQL.  One level of abstraction that could probably be done by coming up with new DataQuery implementations (or refactoring the existing one) would be to allow connecting to relational databases that weren't built according to SilverStripe's specifications with dev/build.  However, I wouldn't try and wedge non-RDBMSes into the Database layer, as most of the code that's included at that level is designed to deal with SQL-specific stuff.

We have a higher abstraction that is designed to do what you want, and we have make this level of abstraction at SS_List, and its companion interfaces, SS_Sortable, SS_Filterable, and SS_Limitable.  You can implement all of the interfaces that your data source supports.  So, Restful API connectors that expose themselves as SS_List are exactly what's intended.  Right now, you would need to write your own static get method on the class in question, so that this would operate as intended:

SomeRestfulObject::get()->filter("Field" => "Value")->count()

You might build this for a single API, or you might make a more generic system that can be configured to connect to multiple APIs that follow the same conventions.

One project that I would particularly like to see is an SS_List based interconnection with Facebook, such that I can drop my status updates or friends list into a GridField.

Looking at DataList, the dbConn that Sam proposed should be more of a function, so that it can pull $this->dataClass->$db_connection(). DataObject::db_connection() then provides a two element array: Type and Config, which by default is array(‘Type’ => ‘MySQL’, ‘Config’ => $databaseConfig). DataList then understands that ‘Type’Database is the class / object to do the querying and *it* in turn understands that ‘Config’ tells it how to connect to the data store. 

No.  $dbConn is an object.  An object has a class and it has properties.  Using a 2 element array like this seems to be trying to ignore the existence of OO.
Rather than a parameter it might, as you say, make sense to refer to an accessor method, getDbConn(), but I'd be inclined to leave this up to the conventions of the oft-discussed DI system that's "comin' along real soon now"(TM).

On to the second problem: Intra-data-store joins. Agree with Sam that we can probably throw an Exception saying: Intra-Data-Store Join is not possible. However, it would be much more elegant if we could somehow entice the driver of the join (i.e. the outer most class) to join the inner-next class’s data with itself / or the inner most class to join the outer-next class’s data with itself. Need to think about this though, as it will incur some sort of big-time performance penalty (think a huge number of rows)… Databases themselves are obviously better at joining than PHP. But, aha, the SS_Database abstraction we just clarified above, could construe a MySQL_LdapDatabase class that takes care of joins between the two (and is a specialised class handling specialised data store requirements based on whatever the users’ needs are, so SS does not have to care about it at all, since all SS data always comes from one data store). The way to go about that would be to have a DataObject with a DualName, i.e. StaffCommission where Staff comes from LDAP, and Commission is a derivative of Salesforce.com and its db_connection is the LDAP_SalesforceDatabase that knows how to calculate Commission (sales volume) for each Sales person (staff) on a monthly basis by meshing the two distinct data stores… Solutions are endless with this approach… one can then even dream of specialised external data modules that provide this sort of glue…

This is really hard to do well.  But, here are some pointers based on the current codebase:

 * Right now, most of the "core" relation linking happens with dot-syntax in DataList::filter() calls.  So, you might say SiteTree::get()->filter(array("Parent.Title" => "Blog")) to get all the children of pages entitled Blog.
 * The code responsible for creating the necessary joins is DataQuery::applyRelation().  It has a lot of knowledge of the specific database structure of DataObjects baked into it.
 * One the relation is applied (i.e., the relevant join clauses are added), SearchFilter::getDbName() returns a field expression that can be used to build a where clause fragment.

But let's imagine that we're filtering on a relation that's not in the database.  Let's say each Member has a FacebookFriends() method that returns friends via the facebook API.  We want to return all SS Members who have a Facebook friend called Sam:

Member::get()->filter(array("FacebookFriends.FirstName" => "Sam"))

There are a number of ways of doing this.  The simplest approach would probably be to do this:

 * Return each member
 * For each member, call $member->FacebookFriends()->filter(array("FirstName" => "Sam"))->count()
 * If that returns a number greater than zero, include the member in the resulting list.

However, that could be extremely slow - imagine if you had 100,000 members!

Perhaps you could do this:

 * Get the Facebook IDs of all the SS Members.
 * call Facebook's friend search API, search for FirstName = Sam AND Facebook ID IN (list from first step).
 * filter the member list by the resulting list of Facebook IDs.

However, any approach like this is going to require much more knowledge about what can and can't be done with the specific API.

This is the most challenging part of implementing this kind of system: integration between different APIs is going to require algorithms specifically designed for the query in question, and general-purpose query language rapidly becomes an AI problem.

The best approach is likely to to be.  For example, if it were possible to query a whole collection of related records at once, that would allow for an optimisation described in the 2nd paragraph.  Something like this:

Member::get()->relation("FacebookFriends")->filter("FirstName" => "Sam")->column("MemberID")

Right now, the above method works if Member::FacebookFriends() returns a RelationList implementing the forForeignId() method.

But, not every API will be able to have this implemented on it. There will need to be introspection APIs available, to describe which filtering mechanisms this APIs have available, and the joining logic will need to see what's available and take the best approach it can.

Possible, but not something for the faint of heart.

Last item that would be of interest, again, is /dev/build, whereby that process only takes care of DataObject descendants that actually use the ‘Default’ data store. (‘Default’ is now the name of the three-way array used in DataConnections from the first part of the solution), and created (in mysite/_config.php) by the installer, much like it is now. 

dev/build just calls requireTable() and requireDefaultRecords() on each DataObject subclass.  It would be fairly straightforward to get this supporting multiple database.
Reply all
Reply to author
Forward
0 new messages