Re: Can web2py be used for highly complex relational databases?

645 views
Skip to first unread message

Cliff Kachinske

unread,
Dec 28, 2012, 7:31:11 PM12/28/12
to web...@googlegroups.com
My advice would be to start with the database back end you will ultimately use. 

With 10,000 tables you would have to explore a way to avoid running the model files with every request.  There are posts in this forum that explain how to do it.



On Friday, December 28, 2012 10:38:25 AM UTC-5, Alex Glaros wrote:
Can web2py be used for highly complex relational databases for large fiscal projects? Example: California's Fi$cal project - http://www.fiscal.ca.gov/ - with roughly 10,000 tables and many complex joins.

What components of web2py would start to get slow or not work well when having so many tables? 

If web2py would instead be better used to prototype the Fi$cal system, what would be good production-version candidates to migrate to? Pure Python using sqlAlchemy? Java? Anything that would make migration easier such as Python-based frameworks?

Thanks,

Alex Glaros

VP

unread,
Dec 28, 2012, 7:57:41 PM12/28/12
to
What is the level of dependency of this application?  I bet you can factor this project into dozens of different apps, which communicate with each others through APIs if needed.   It's hard to think of an app with 10,000 tables all interdependent.  

Massimo Di Pierro

unread,
Dec 28, 2012, 11:39:51 PM12/28/12
to web...@googlegroups.com
I have some experience with a large peoplesoft system where they claims thousands of tables. Turns out almost everything I needed was in less than 10 tables.

Basically most of the tables were lookup tables used simply as key:value store for the possible values of a field. For example table GENDER {'M':'Male','F':'Female','O':'Other'}.

So 10 tables with 100 columns each and one lookup table for each column gives you more than 10,000 tables. I suspects that is your case too.

All systems that claim an insane amount of tables belong to this category.

The way to handle it is to load all lookup tables in cache and use cache instead of database access to convert key<->value. In fact the values for lookup tables almost never change.

I suggest before you embark in this venture do the following exercise: make a list of all table names. For each table make list of fields in the table and count the number of records (more or less).

You will find many tables with less then 100 records and less then 10 columns. You will find a few tables with more than 10 columns and more then 100000 records. You need to find out how many tables belong to one category and how many to the other.

If this is the case, as I suspect, than you can use web2py but you need to setup some clever caching system to hable the lookup tables. It would be the same with other frameworks since you don't want to join everything all the time or your database will grind to a halt.

It is also possible I am completely wrong in my assumption.

In the case of the peoplesoft system I studied they also were storing all past versions of each record in the same table as the current record. Basically every record had had two dates (valid_from, valid_until). Current records had valid_until set to 2999-12-31. records would never be modified. The process for modifying a record consisted of creating a copy of the current record, editing the copy, setting the valid_until=now for the previous current record, updating all references pointing to the record. Af course all  tables used the same mechanism for versioning thus making the update process very slow and cumbersome, and all tables un-necessary large. Yet this simplifies auditing because you can go back to any moment in time simply by filtering records in a query.

The reason I am explaining all of this is that probably you are going to have to deal with something like this. The problem is not web2py vs other framework. The problems will be that you need special logic to handle those tables which is foreign to web2py and many modern frameworks which simply assume more moder database design practices.

My suggestion is start small and see what happens. Find who are your primary target users. Find which tables they need to access and create a web interface for those tables. You will probably be able to factorize the interaction with the database in many small apps.

Massimo

Lazarof

unread,
Dec 29, 2012, 3:28:07 AM12/29/12
to web...@googlegroups.com
Hello Alex.
Could you comment what Massimo explained? Is that the situation?
I bet that's.


пятница, 28 декабря 2012 г., 19:38:25 UTC+4 пользователь Alex Glaros написал:

Massimo Di Pierro

unread,
Dec 29, 2012, 11:47:42 AM12/29/12
to web...@googlegroups.com
I should add that in the system I have looked almost all tables where de-normalized. All tabled storing a key would also stored the value corresponding to the code. This is for two reasons: 1) if you have thousands of tables you cannot join everything all the time. 2) for auditing purposes it should be possible to change a value in a lookup table without changing the corresponding previous values of records created before.

This means that while I needed the lookup tables for IS_IN_SET() IS_IN_DB() validators, I never needed more than two joins.

Again, the all design seems crazy to many of use but it has its plus sides from an auditing point of view.

Massimo

Massimo Di Pierro

unread,
Dec 29, 2012, 12:30:19 PM12/29/12
to web...@googlegroups.com
TLTR: if you can do it with others you can do it with web2py, although default behavior may be different.

There are some important differences between web2py DAL and SQLAlchemy: sqlalchemy has better support for non-integer primary keys (dal supports them too but they make automatic form handling very difficult so we want to discourage that); 2) sqlalchemy allows multiple database sessions per thread over the same connection (in web2py if you want to use multiple session in the same thread, you need multiple connections, I have never seen a use case for this). 3) frameworks which use SQLAlchemy define tables only once when the program starts, web2py defines tables at every request. They are both equally reliable and fast.

3) causes a performance penalty for web2py compared with other frameworks when large number of tables are present. Yet this is not a DAL limitation. This is because of the way web2py choses to use the dal. We want to allow different http requests to see changes in the schema. This is part of the price for the easy to use approach. Using conditional models go a long way towards reducing this overhead. If you use the DAL from a pyhton script or a background process of other framework (for example Tornado), you do not have the overhead problem. Even in web2py there are ways to avoid it completely although not documented yet (define tables at top-level in a module, import db from the module, then call db.reconnect() in models).

Let me put it in another way. In web2py you can put logic in modules or in models. They are two types of files (one is imported one is executed). The more logic you put in models, the more agile it is. The more logic you put in modules, the faster it in. Other frameworks do not make this distinction (models are modules and are imported too) and therefore they give you no option.  

Web2py can definitively be used to build a complex system with many tables. As with any framework the bottle neck will always be database access to you need lots of caching. Whether it is the right system to interface with an existing database that I cannot say but it depends on the existing database but you will have issues with any framework you choose.

Massimo



On Saturday, December 29, 2012 11:02:34 AM UTC-6, Alex Glaros wrote:
Hi Lazaro and Massimo,

The Fi$cal project is not my project; I don't know how many tables it has, but is an example of highly relational financial software so that web2py members could understand my question clearly. Most projects will be in strict 3rd Normal Form.

Let me be more transparent regarding my goals.

I have a nonprofit government improvement organization and part of our mission is to recommend software architecture to government. I am evaluating web2py as a candidate for prototyping applications. I'm very excited that Massimo's vision to lower the average person's programming entry barrier also works to improve government by allowing rapid prototyping.

The next step is to assess how to best migrate prototypes to production. There maybe no further steps necessary, or for large projects like Fi$cal, there may be a consensus such as moving the entire project to Python/sqlAlchemy which provides the most stability, but allows copying of some web2py components for reuse.

It is a significant responsibility to make these recommendations and I need many diverse advisors to select products that reduce risk in government software development.

  1. At this time is there a consensus regarding the top candidates for open source business-oriented, highly relational software tools?
  2. What would the top 10 look like?
  3. What are the reasons that each one was selected?
  4. At what points or areas would other products surpass web2py?

Any comments would be much appreciated,

Alex

Massimo Di Pierro

unread,
Dec 29, 2012, 2:07:42 PM12/29/12
to web...@googlegroups.com
To better illustrate my previous point I built an example which may be useful to others who are concerned about performance.


Contains two apps a1000_1 and a1000_2.

Both apps define 1000 tables each with 10 columns (no attributes, validators, widgets, no auth). They have one controller which expose a grid http://..../a1000_1/default/index/t00001 gives you the grid for table t00001.

a1000_1 uses normal web2py models (defined in db.py and db_tables.py) but it bytecode compiled, uses lazy_tables, and disables migrations. On my slow laptop is serves one page in 0.9 seconds. It means it takes less than 0.9ms to define each lazy table. 

a1000_2 defines all tables in modules/db_modules.py imports db in models/db.py. Now it takes 80ms/req. more of 50% of it is because of the complex template and cookie/session/languages logic. basically there is no overhead in creating 1000 tables.

While web2py defaults to a1000_1, other frameworks default to a1000_2 but web2py can do it too!

What are the drawbacks of the second approach?  There are 2 and they are not web2py specific.

1) The web server occasionally creates new processes. They will need to reload the module. This takes time. Some users will experience the slow down.
2) Now the db object is shared by all request (not the connector, they still different db sessions/transactions). This means you cannot alter the attributes of models (db.t00001.f001.readable=False) because this will affect all concurrent requests and following requests. 

Web2py makes you pay a price in order to have one new clean db object at every request. You pay this price so that you do not have to worry about changing its state and affecting other requests. This costs ~1ms/table on my laptop.

With conditional models this is a non-issue since you only the price for tables you use in the action.

Hope I am making some sense.

Massimo

Michele Comitini

unread,
Dec 29, 2012, 4:30:10 PM12/29/12
to web...@googlegroups.com
This thread reminds me of a feature I would like to see in the DAL.

 The DAL is the missing introspection features SQLAlchemy has: legacy db introspection.  So with the DAL the developer needs to write always the metadata to describe the db schema.  

I faced that in a fairly complex sqlite schema. I hate to write things that are already there ;-) ... so I *had* to make a simple script to translate DDL of sqlite schema to proper DAL data definition instructions, it can be found under the scripts directory in recent web2py versions.
The DAL should have introspection and eventually build in memory cached table definitions on the fly.  Optionally the developer could decide to make those table definitions persistent to reduce start-up times and make customizations.

mic



2012/12/29 Massimo Di Pierro <massimo....@gmail.com>
--
 
 
 

Massimo Di Pierro

unread,
Dec 29, 2012, 5:30:08 PM12/29/12
to web...@googlegroups.com
The problem is that SQLA is from this prospective lower level. You can use scripts/extract_pgsql_models.py to introspect the database and "guess" the model, but while this information is sufficient for SQLA, it is not sufficient for web2py. Web2py needs to know more about the column than it is known to the database. For example a varchar field could be an hashed password, or a serialized list of references, etc.  Introspection does not help you except in trivial cases.

Massimo

Richard Vézina

unread,
Feb 14, 2013, 5:01:15 PM2/14/13
to web2py-users
Really interesting thread... Thanks Massimo about the clear explanation about web2py pros and con... Also I like the lookup table cache tricks...

Richard


--
 
 
 

Reply all
Reply to author
Forward
0 new messages