Help defining architecture for app

155 views
Skip to first unread message

Francisco Betancourt

unread,
Apr 2, 2014, 7:33:59 PM4/2/14
to web...@googlegroups.com
Hello everyone.

I need some help defining the architecture for an application. We already have something in place, but the size of the the users data files is making the site very unresponsive. The app works as follows:

  1. The user uploads a csv file
  2. The user file is read to a table for further processing
  3. The user previews the data I read from his file (to see if everything is ok)
  4. If data is ok he clicks a button which will make us generate an xml (using some of the data he uploaded) and send the xml to a web service
  5. The web service is not ours, and we must send the xml files one at a time (though we can send hundreds simultaneously)
  6. The web service returns data, and we store that data into the db
  7. Once done we offer a print friendly version of the final data

So currently we are doing the following:

  1. Receive a file, and once saved we process it with web2py built in import from csv.
  2. Once data has been read we show a view with all the rows and a button to start the process
  3. Using js we send groups of 20 rows at a time through ajax, to be processed (to xml) and sent to the web service
  4. Each ajax call returns js code to update a progress bar

Originally this files were suppose to consist of hundreds hardly a thousand row, but in the end we have files with 15000 rows and the average is about 4000. Incredibly the view (even with 15000 rows and js) works well. But the site becomes quite unresponsive (especially because there are normally a dozen or so users doing the same thing simultaneously).

We have already done as many optimizations as we know (use migrate=False for db, call session.forget in ajax controller functions, byte compile the code) but it still is way too slow. So we are making some heavy changes. I want to try to use the scheduler and workers to do this job, but I fear it will generate even more db queries and make things worse. So I would like some suggestions on how to tackle this problem??

Is scheduler the way to go? An if so should I make a task for every xml and webservice call (this seems like a bad idea to me). Or should I group them in chunks of an arbitrary size (e.g. 50)? And if I make this change will I be able to display the progress of the process?

Thanks for all the help. Good day.

Niphlod

unread,
Apr 3, 2014, 12:17:09 PM4/3/14
to web...@googlegroups.com
I think the main point you need to assess if you have enough resources to do what you want. If your server can't keep up with the work you're asking it to do, then using scheduler or whatever won't bring any benefits.
If instead you have a not-so-snappy user-interface because you don't have enough power to respond quickly to a user, then the scheduler is a right fit.

If you're seeing all the "hurdle" of the system on the db backend, then you're in the same situation as the first one, i.e. you need to "buy more juice".

Francisco Betancourt

unread,
Apr 4, 2014, 10:30:27 AM4/4/14
to web...@googlegroups.com
Thanks for the response Niphlod. I believe the bottleneck is the db, I am rewriting the code to use less queries, but in general terms I think I agree with you, I need a faster server. I currently host on Digital Ocean and I'm running a dual core with 2GB in RAM server. The statistics for the server show CPU usage never reaches 100% (not even 50%),even when doing the big files, so I guess the problem is I/O. This server already use SSDs, so I was thinking of using a second db server to divide the work, so the information that I get back from the web service is stored in another db, and then I could use this db also for my scheduler tasks.

Do you think this makes sense?

LightDot

unread,
Apr 4, 2014, 12:50:07 PM4/4/14
to web...@googlegroups.com
It would be prudent to see some numbers and to learn about your setup more...

So, what is your current setup like? Which web server are you using, which database? How much simultaneous db connections are we talking about?

Digital Ocean is using KVM virtualization, correct? How is you memory usage, which OS are you using within your VPS? How bad is the i/o really..?

Francisco Betancourt

unread,
Apr 4, 2014, 1:14:39 PM4/4/14
to web...@googlegroups.com
Hello LightDot

Thanks for the interest in my post, hope we can come up with a way to improve performance. This is my current setup:

  • Host: Digital Ocean (and yes I do think their Droplets (as instances are called) are KVM)
  • OS: Ubuntu 13.04
  • Web Server: Apache 2.2
  • Database: PostgreSQL 9.1
  • Memory usage in peak times is about 1.4GB (out of 2GB)
  • By the way I have no swap partition
  • I don't know how to count db connections, but in my db definition I used the poolsize=50, but again I don't know how to check the amount of connections at any given time
  • Disk usage acording to Digital Ocean metrics is medium never high (do truly I have never understand their graph)
  • CPU usage at some points gets close to 50% (since this a dual core I would assume one of the cores is at 100%)

I don't know what else to mention but, if I missed anything please ask. And thanks again.

Leonel Câmara

unread,
Apr 4, 2014, 2:34:27 PM4/4/14
to web...@googlegroups.com
May I ask why are you storing the csv file rows in the database if what you want is to send it to the webservice for processing?

If there are only a dozen or so users at a time this could easily fit in memory. I would store everything in cache ram with the session_id as key and delete it after processing or after a certain time was elapsed. This should speed up your processing by several orders of magnitude.

That said, sending 20 rows at a time to the webservice when the files have many thousands of rows, will always be extremely slow. At the very least, you will have latency, tcp connection establishing, http connection establishing, etc. per call. All that adds up to a huge amount of time.  
  
I would say that this webservice is not adequate for your objectives if this being slow is a problem. You would need a webservice that would let you send much more rows at a time.
   
I am assuming that the webservice is not part of your application, and that your application is not between the ajax calls in the client and the webservice. So there's not much optimization for you to do there.

Francisco Betancourt

unread,
Apr 4, 2014, 2:46:33 PM4/4/14
to web...@googlegroups.com
Didn't knew you could do this. But sadly users preview the data uploaded before the process and require the information to be stored for archival purposes for a good amount of time (say years) so I do need the db.

I already talked to the web service provider and told them just that, and that they should add a method to send multiple rows at a time (or all of them), to avoid the connection overhead per row. They told me they would evaluate it, but I don't see them quite convinced, and they told me I must increase the amount of requests from 20 to say 100 that their service should be able to handle it. I was going to try that out, but there's where I wanted to use the scheduler to program my task and run a good number of workers.

Leonel Câmara

unread,
Apr 4, 2014, 3:40:25 PM4/4/14
to
> Didn't knew you could do this. But sadly users preview the data uploaded before the process and require the information to be stored for archival purposes for a good amount of time (say years) so I do need the db.

Well, unless they need to do some stuff with this data besides archiving somehow, you can just save the csv file they have just uploaded.

Whoever is providing the webservice is retarded :)

Is your server the one calling the webservice or the client? If it's the client there's not much of a point in even considering the scheduler and instead you may consider using webworkers. If it is the server you definitely should use the scheduler, you should also look and see if it's possible to cache the webservice responses.  

Francisco Betancourt

unread,
Apr 4, 2014, 3:58:54 PM4/4/14
to web...@googlegroups.com
I must call it from my server, data from csv is processed, and this data is sent to the web service. Also I was hopping that since it's our server the one working on this, user can close their browser and come back hours later to see the progress rather than needing to have their browsers open for so much time. And that is another point for the scheluder.

Derek

unread,
Apr 4, 2014, 4:10:47 PM4/4/14
to web...@googlegroups.com
I think for your use case, you may want to consider using Tactic instead of Web2py.

Francisco Betancourt

unread,
Apr 4, 2014, 4:50:53 PM4/4/14
to web...@googlegroups.com
Never heard of Tactic. I will check it. But for what I see it runs in Windows and MacOS only.

黄祥

unread,
Apr 4, 2014, 5:55:39 PM4/4/14
to web...@googlegroups.com

Derek

unread,
Apr 4, 2014, 6:44:58 PM4/4/14
to web...@googlegroups.com
Yes, that's it.

LightDot

unread,
Apr 4, 2014, 11:14:32 PM4/4/14
to web...@googlegroups.com
On Friday, April 4, 2014 7:14:39 PM UTC+2, Francisco Betancourt wrote:
Hello LightDot

Thanks for the interest in my post, hope we can come up with a way to improve performance. This is my current setup:

  • Host: Digital Ocean (and yes I do think their Droplets (as instances are called) are KVM)
  • OS: Ubuntu 13.04
  • Web Server: Apache 2.2
  • Database: PostgreSQL 9.1
  • Memory usage in peak times is about 1.4GB (out of 2GB)
  • By the way I have no swap partition
  • I don't know how to count db connections, but in my db definition I used the poolsize=50, but again I don't know how to check the amount of connections at any given time
  • Disk usage acording to Digital Ocean metrics is medium never high (do truly I have never understand their graph)
  • CPU usage at some points gets close to 50% (since this a dual core I would assume one of the cores is at 100%)

I don't know what else to mention but, if I missed anything please ask. And thanks again.


This is good information to start with. Do you have more statistics, such as top and iostat output at the peak times, etc. ?

Postgresql itself has statistics available, for example see http://www.postgresql.org/docs/9.1/static/monitoring-stats.html

Did you already tune apache or postgresql in any way? if so, what are your settings..? On the web2py side and for your current use case, the poolsize=50 setting might be ok or not, depending on your apache/pg setup...

Francisco Betancourt

unread,
Apr 5, 2014, 12:37:59 PM4/5/14
to web...@googlegroups.com
I have not tuned Apache nor Postgres, their both vanilla installs through APT. I don't have the top or iostat output right now, I will simulate the load tomorrow that users don't work on the app and post back on monday. I will also enable PostgreSQL statistics and post them too. Thanks for the help.

Francisco Betancourt

unread,
Apr 7, 2014, 2:37:02 PM4/7/14
to web...@googlegroups.com
Hi, I finnished the testing.

I attach the graphs of my simulated load. They were created using dstat and vmstax. The idle graph show the system while idle, while the load files show the performance hit when under the simulated load. I also attach the top output while working under load.

From what I see the server never gets close to full utilization, yet, requests become very slow and I even got an Internal Error at some point (also attached screen capture). My simulated load were three users doing close to 25000 web service requests and storing results to db (thought I didn't let it finish). So at some point I must be doing close to 60 to 100 ajax requests per second, which doesn't seem too many for me.

Any one sees something weird with the date I post?? Or has any suggestions??

Thank you all for your time and help.
idle-graph.png
load.png
load2.png
internal_error.png
top_load.txt.zip

Derek

unread,
Apr 8, 2014, 3:41:20 PM4/8/14
to web...@googlegroups.com
There was a post a while back which gives you the recommended settings for Apache. Apache is very slow for Web2py and other python projects. nginx is much better.
Reply all
Reply to author
Forward
0 new messages