Kind'a TL, but please DR - Need your thoughts

301 views
Skip to first unread message

Mario R. Osorio

unread,
Jan 31, 2016, 11:51:08 PM1/31/16
to Django users

I need comments on an application I have been recently proposed. The way it is being envisioned at this moment is this:


One python daemon will be listening for different communications media such as email, web and SMS (web also). IMHO, it is necessary to have a daemon per each media. This daemon(s) will only make sure the messages are received from a validated source and put such messages in a DB


A second(?) python daemon would be waiting for those messages to be in the DB, process them, act accordingly to the objective of the application, and update the DB as expected. This process(es) might included complicated and numerous mathematical calculations, which might take seconds and even minutes to process.


A third(?) python daemon would be in charge of replying to the original message with the obtained results, but there might be other media channels involved, eg the message was received from a given email or SMS user, but the results have to be sent to multiple other email/SMS users.


The reason I want to do the application using Django is that all this HAS to have multiple web interfaces and, at the end of the day most media will come through web, and have to be processed as http requests. Also, Django gives me a frame to make this work better organized and clean and I can make the application(s) DB agnostic.


Wanting the application to be DB agnostic does not mean that I don't have a choice: I know I have many options to communicate among different python processes, but I prefer to leave that to the DBMS. Of the open source DBMS I know of, only Firebird and PostgreSQL have event that can provide the communication between all the processes involved. I was able to create a very similar application in 2012 with Firebird, but this time I am being restricted to PostgreSQL, which I don't to oppose at all. That application did not involve http requests.


My biggest concern at this point is this:

If most (if not all) requests to the application are going to be processed as http requests, what will happen to pending requests when one of them takes too long to reply? Is this something to be solved at the application level or at the server level?



This is as simple as I can put it. Any thoughts, comments, criticism or recommendations are welcome.


Thanks a lot in advanced!

Avraham Serour

unread,
Feb 1, 2016, 1:30:34 AM2/1/16
to django-users
if a process takes too long to complete it won't be able to process new requests, so you will be limited to the number of workers you told uwsgi to use.

http requests should be short lived, if you have some heavy processing  to do the http request should return with something like 'accepted' and send the job to a queue (you can use celery)

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/802693b7-c00c-46bf-9902-688b27e21bbd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Bill Freeman

unread,
Feb 1, 2016, 11:04:33 AM2/1/16
to django-users
I suggest that you use Celery.

If people are making HTTP requests of you, that is reason enough to choose Django.

But do not wait for long calculations to complete before returning an HTTP result.  Instead redirect to a page containing simple JavaScript that will poll for a result.

PostgreSQL is my favorite SQL DB, and I think that you are in good shape there.  Another popular free DB is MariaDB, but I prefer PostgreSQL.

Record the request data in the DB invoke a Celery task to send the Primary Key of the new entry to a worker process.  This simply queues a message in RabbitMQ, so it is quite fast.  (Other MQs are possible, but RabbitMQ is best tested with Celery.)  This lets you return the HTTP response very quickly.

Any additional daemons needed to poll for or listen for request coming by other than HTTP request can also store to the DB and call a Celery task.

Rather that polling the DB for work, the Celery worker system, when a worker process is ready, takes a message from the queue (RabbitMQ) and assigns it to that worker.  Multiple workers can be handling separate messages in parallel, but do size the worker pool according to the size of your machine.  The worker fetches the request from the DB, and can run for as long as necessary to perform the calculation, even, if necessary, reaching out to other resources over the net.  When it is done, it stores the response in the DB, for the polling by the requester's JavaScript to find. and/or send copies via other mechanisms, such as an SMS interface.  Then the worker becomes available to handle another message.

I can assure you that this works well on Linux (you don't mention the platform).  I have not used Celery (or Django, for that matter) on Windows or Mac, but I'll bet that it runs fine, modulo the usual surprises about file system differences and the way that Windows processes are "special".

Pretty much you just code in Python.  The exception is startup scripts to boot time start/manage  the celery works, Apache/nginx front end for Django, and any additional required communications processes.  I guess there is also that small JavaScript to poll for a result.

An alternative to the JavaScript is a button for the user to push to see if the results are ready, and you should probably implement that anyway (and use JavaScript to hide it) for those with JavaScript disabled.

James Schneider

unread,
Feb 1, 2016, 5:58:05 PM2/1/16
to django...@googlegroups.com
On Sun, Jan 31, 2016 at 8:51 PM, Mario R. Osorio <nimbi...@gmail.com> wrote:

I need comments on an application I have been recently proposed. The way it is being envisioned at this moment is this:


One python daemon will be listening for different communications media such as email, web and SMS (web also). IMHO, it is necessary to have a daemon per each media. This daemon(s) will only make sure the messages are received from a validated source and put such messages in a DB



So this is effectively a feed aggregation engine. I would recommend having a separate daemon running per media source, so that issues with one media source do not affect the operations of another. It would be possible to do everything with one daemon, but would be much trickier to implement.
 

A second(?) python daemon would be waiting for those messages to be in the DB, process them, act accordingly to the objective of the application, and update the DB as expected. This process(es) might included complicated and numerous mathematical calculations, which might take seconds and even minutes to process.


Implementation here is less critical than your workflow design. This could be implemented as a simple cron script on the host that runs every few minutes. The trick is to determine whether or not a) records have already been processed, b) certain records are currently processing, c) records are available that have yet to be processed/examined. You can use extra DB columns with the data to flag whether or not a process has already started examining that row, so any subsequent calls to look for new data can ignore those rows, even if the data hasn't finished processing. 

If you want something with a bit more control than cron, I would recommend a Celery instance that can be controlled/inspected by a Django installation.


A third(?) python daemon would be in charge of replying to the original message with the obtained results, but there might be other media channels involved, eg the message was received from a given email or SMS user, but the results have to be sent to multiple other email/SMS users.


This is the same as the previous question, just with a different task. 
 


The reason I want to do the application using Django is that all this HAS to have multiple web interfaces and, at the end of the day most media will come through web, and have to be processed as http requests. Also, Django gives me a frame to make this work better organized and clean and I can make the application(s) DB agnostic.


What do you mean by 'multiple web interfaces'? You mean multiple daemons running on different listening ports? Different sites using the sites framework? End-user browser vs. API? 

If most of your data arrives via HTTP calls, then a single HTTP instance should handle that fine. Your views can shovel data wherever it needs to go.
 
As far as being DB agnostic, I'm assuming that you mean the feed sources don't need to know the DB backend you are using? Django isn't really doing anything special that any other framework can't. 


Wanting the application to be DB agnostic does not mean that I don't have a choice: I know I have many options to communicate among different python processes, but I prefer to leave that to the DBMS. Of the open source DBMS I know of, only Firebird and PostgreSQL have event that can provide the communication between all the processes involved. I was able to create a very similar application in 2012 with Firebird, but this time I am being restricted to PostgreSQL, which I don't to oppose at all. That application did not involve http requests.


Prefer to leave what to the DBMS? The DBMS is responsible for storing and indexing data, not process management. Some DBMS' may have some tricks to perform such tasks, but I wouldn't necessarily want to rely on them unless really necessary. If you're going to the trouble of writing separate listening daemons, then they can talk to whatever backend you choose with the right drivers.

Choose the database based on feature set, compatibility with your host systems, and perhaps even benchmarks based on the type of data you may be storing (short life, high-read rate vs. long term low read volume single table data vs. sporadic read/write, etc.). Some databases handle certain situations better than others (i.e. If you are using UUID's rather than integers for primary keys, Postgres would likely be better than MySQL since it has special optimization for indexing UUID fields). 
 


My biggest concern at this point is this:

If most (if not all) requests to the application are going to be processed as http requests, what will happen to pending requests when one of them takes too long to reply? Is this something to be solved at the application level or at the server level?


Hence my comments about workflow. You'll need to decide the proper timers and what happens to that data. If you mean the operating system level when you mention 'server level', the OS will only manage the raw connection details themselves (such as the TCP timeouts, etc.). The data that is being processed is irrelevant to the OS. Your application needs to make the determination about the behavior when a connection times out, or a source takes too long to provide data (perhaps the host is keeping the TCP connection alive, but is not sending any data). That's up to you to decide. Your application should be aware of all of those scenarios and should act according to your workflow design, which would include timers and default behavior for all actions within the entire application. 

You'll need to design the entire application such that you do not have data inconsistencies, and what to do when those inconsistencies are encountered. The contents of the data will drive the requirements. Can you throw away an update from a feed if you only get one out of two pages that you were expecting? Or do you keep the single page and somehow factor that in to your other calculations? Will that ruin search results? 


This is as simple as I can put it. Any thoughts, comments, criticism or recommendations are welcome.



It's difficult to put simply because you are not necessarily describing a simple system. I'm sure it will become even more complex as you get further through the design process once you've gathered requirements. Requirements and workflow design will drive your implementation to meet the goals. For example, you would install Celery because you have a requirement to run the task 4 times a day normally, but would also like to trigger a on-demand task at any time. With cron, you can't (easily) do that. 

Figure out 'what' you want to do first, then figure out 'how'. It's easy to jump straight into implementation, but with a system as complex as the one you're describing, you should spend a bit of time with a flow chart tool and coming up with use cases before anyone hits a keyboard.

-James

Mario R. Osorio

unread,
Feb 1, 2016, 11:49:20 PM2/1/16
to Django users
Thanks to each and every one of you for the VERY HELPFUL recommendations you have given me (Avraham Serour, ke1g & James Schneider so far) I really got back more than I expected, but as I told a friend of mine; it was worth every minute of the almost 4 hours it took me to redact this question, not without errors though, because when I wrote "This is as simple as I can put it", I really should have said "This is as SUCCINCT as I can put it". I was just wanted to avoid TL;DNR responses.

More than giving me ideas, you guys basically gave everything but the code. 

MUCHAS GRACIAS again!

I will now touch on each of the thoughts each one of you shared.

Mario R. Osorio

unread,
Feb 2, 2016, 12:02:48 AM2/2/16
to Django users
I understand HTTP requests are short lived, I just could not figure out how to handle the possible long responses, I' m rather new to python AND to web programming and when I start mixing things like WSGI my brain just burst. I know about celery, just never thought it would be my allied in this endeavor, but I did consider twisted though.

OTOH, not being a web programmer, it takes me a while to identify what I can do at the browser. You gave me a DUH! moment and I thank you for that.

Mario R. Osorio

unread,
Feb 2, 2016, 12:38:13 AM2/2/16
to Django users
I suggest that you use Celery.

A very useful tip! 

If people are making HTTP requests of you, that is reason enough to choose Django.

Well, let me put it this way; I felt in love with Python almost 7 years ago, but when I met Django, this love turned into an obsession you see; I had been programming database applications for over 35 years, never paying attention to a "strange" programming language "named after a snake" :D But not only that; it "comes" with LOTS of web frameworks that make sense to me,and treats DBMS with respect ... a match made in heaven!!

Just for the record, I have programmed in many languages over the years and never felt like this about any of them. Any PHP framework is out of the question for me; I can read and understand the code; but just the whole concept of PHP makes me want to puke (literally ... seriously); spaghetti code, dozens of functions that do almost the same thng, or dozens of names for the same function .... No thanks, I pass!

But do not wait for long calculations to complete before returning an HTTP result.  Instead redirect to a page containing simple JavaScript that will poll for a result.

As I said before, a DUH! moment ... or is it a senior moment ... I'm starting to get worried here.

I can assure you that this works well on Linux (you don't mention the platform).  I have not used Celery (or Django, for that matter) on Windows or Mac, but I'll bet that it runs fine, modulo the usual surprises about file system differences and the way that Windows processes are "special".
I'm indeed using Linux; I'm not exactly a big fan of Windoze; have never been, though I started to feel more respect towards ir with the 8.1 version. It really is a more robust OS ... I come from the dark ages of CPM->UNIX->DOS->WINDOWS and then I met Linux.
 

Pretty much you just code in Python.  The exception is startup scripts to boot time start/manage  the celery works, Apache/nginx front end for Django, and any additional required communications processes.  I guess there is also that small JavaScript to poll for a result.

Don't mean to start yet another heated discussion on web server preferences and trust me, I've done my research comparing Apache Vs NGINX, and I still feel this is one of the toughest decisions we will have to make ... I kind'a like NGINX, but its like the words "heavy duty" and "robust" can only be mentioned when you talk about Apache.

Many, many thanks for your input!

Mario R. Osorio

unread,
Feb 2, 2016, 1:49:34 AM2/2/16
to Django users
y So this is effectively a feed aggregation engine. I would recommend having a separate daemon running per media source, so that issues with one media source do not affect the operations of another.

 I never would have thought of this application as a feed aggregation engine, but I'm not really sure it fits the definition, will be digging deeper into this
 

It would be possible to do everything with one daemon, but would be much trickier to implement.

I agree 120%

 

A second(?) python daemon would be waiting for those messages to be in the DB, process them, act accordingly to the objective of the application, and update the DB as expected. This process(es) might included complicated and numerous mathematical calculations, which might take seconds and even minutes to process.


Implementation here is less critical than your workflow design.

I agree yet, this is the heart of my application. I understand it basically only involves the (web) application and the DBMS w/o any other external element; It is here where the whole shebang happens, but it might just be the DB application programmer in me though.

This could be implemented as a simple cron script on the host that runs every few minutes. The trick is to determine whether or not a) records have already been processed, b) certain records are currently processing, c) records are available that have yet to be processed/examined. You can use extra DB columns with the data to flag whether or not a process has already started examining that row, so any subsequent calls to look for new data can ignore those rows, even if the data hasn't finished processing. 

You gave me half my code there, but I'm not sure I want to trust a cron job for that. I know there are plenty of other options to do the dirty laundry here, such as queues, signals, sub-processes (and others?) but I kind'a feel comfortable leaving that communication exchange to the DBMS events as I see it; who would know better when 'something' happened but the DBMS itself?

The reason I want to do the application using Django is that all this HAS to have multiple web interfaces and, at the end of the day most media will c--ome through web, and have to be processed as http requests. Also, Django gives me a frame to make this work better organized and clean and I can make the application(s) DB agnostic.


 
What do you mean by 'multiple web interfaces'? You mean multiple daemons running on different listening ports? Different sites using the sites framework? End-user browser vs. API? 

A combination of all that and probably a bit more ... This is something I left out trying to evade the TL;DNR responses: I'm considering having this app return nothing but probably json or xml code for other applications to "feed" from it. (here is that feed word again!), there are a myriad of possible ways this application can be used. This, BTW, would leave all the HTML/CSS/Javascrpt/etc "problems" to someone else ... it might just be the DB app programmer in me trying to avoid dealing with web issues, or I might just be trying to make things harder for me; this is something I haven't really thought much about.


Wanting the application to be DB agnostic does not mean that I don't have a choice: I know I have many options to communicate among different python processes, but I prefer to leave that to the DBMS. Of the open source DBMS I know of, only Firebird and PostgreSQL have event that can provide the communication between all the processes involved. I was able to create a very similar application in 2012 with Firebird, but this time I am being restricted to PostgreSQL, which I don't to oppose at all. That application did not involve http requests.


Prefer to leave what to the DBMS? The DBMS is responsible for storing and indexing data, not process management. Some DBMS' may have some tricks to perform such tasks, but I wouldn't necessarily want to rely on them unless really necessary. If you're going to the trouble of writing separate listening daemons, then they can talk to whatever backend you choose with the right drivers.

I understand I'm having the DBMS do some of the process management, but it only goes as far as letting other processes know there is some job to be done, not even what needs to be done. I don't thing the overhead on the DBMS is going to be all that big.

This whole application is an idea that's been in my mind for some 7 years now. I even got as far as having a working prototype. I was just starting to learn Python then and my code is a shameful non pythonic mess. But it worked. I used Firebird as my RDMS, and all feeds (again?) would come in and out through an ad-hoc gmail account (with google voice for SMS messaging) I would get the input, process it and return the output within 10 to 40 seconds, with the average at around 20 which is satisfying if you consider the app is not really controlling the "medium". Of course, I never even considered any heaving testing as there were many limitations, the 500 outgoing messages per day being just the first one.  It just proved my concept. ande served as a very good (and long) exercise in Python.

I recently shared my thoughts with some close friends that linger around other branches of (IT related) knowledge and they liked they idea, hence the request for your input, for which I feel very much obliged.

Thanks a BUNCH!

============
DISCLAIMER!
============
I do not mean to argue any of the ideas you and all others have shared with me, on the contrary; you have fed even more my curiosity and curiosity well managed usually turns into knowledge. I can't do different from thanking all of you for that gift.

James Schneider

unread,
Feb 2, 2016, 3:48:48 AM2/2/16
to django...@googlegroups.com
On Mon, Feb 1, 2016 at 10:49 PM, Mario R. Osorio <nimbi...@gmail.com> wrote:
y So this is effectively a feed aggregation engine. I would recommend having a separate daemon running per media source, so that issues with one media source do not affect the operations of another.

 I never would have thought of this application as a feed aggregation engine, but I'm not really sure it fits the definition, will be digging deeper into this

Maybe not in the traditional sense of pulling in RSS, Twitter, Facebook, etc., but it sounds like you want to perform the same task with other message types like email and an SMS gateway. There may be applications that exist already in the SaaS space if your SMS gateways are publicly accessible and support remote integration options (I assume they do since you mentioned it as one of your sources). There's probably something out there for ripping out email content and dropping it in a database as well, or serializing that data into a format that can be easily parsed and dropped into the database by a small integration script (perhaps JSON to Postgres?). I'm not familiar with any, but I'd be shocked if they don't exist. You can look at that as an option to rolling your own solution, which may end up cheaper and easier in the long run since you wouldn't be responsible for maintaining that portion of the system. The folks who run those systems/applications are more familiar with the edge cases and probably already have them handled. Worth a Google anyway to possibly avoid re-inventing an already perfectly round wheel.
 
 

It would be possible to do everything with one daemon, but would be much trickier to implement.

I agree 120%

 

A second(?) python daemon would be waiting for those messages to be in the DB, process them, act accordingly to the objective of the application, and update the DB as expected. This process(es) might included complicated and numerous mathematical calculations, which might take seconds and even minutes to process.


Implementation here is less critical than your workflow design.

I agree yet, this is the heart of my application. I understand it basically only involves the (web) application and the DBMS w/o any other external element; It is here where the whole shebang happens, but it might just be the DB application programmer in me though.

Ah, now I see the bias. ;-) I'm a network administrator by trade, so I can fix any problem with the right router. I totally get it. :-D
 
I mentioned workflow for exactly the reason you pointed out, this is the heart of the application, and if its wrong, the rest of the system fails. 


This could be implemented as a simple cron script on the host that runs every few minutes. The trick is to determine whether or not a) records have already been processed, b) certain records are currently processing, c) records are available that have yet to be processed/examined. You can use extra DB columns with the data to flag whether or not a process has already started examining that row, so any subsequent calls to look for new data can ignore those rows, even if the data hasn't finished processing. 

You gave me half my code there, but I'm not sure I want to trust a cron job for that. I know there are plenty of other options to do the dirty laundry here, such as queues, signals, sub-processes (and others?) but I kind'a feel comfortable leaving that communication exchange to the DBMS events as I see it; who would know better when 'something' happened but the DBMS itself?

For long running processes, you'll want flags that are persistent, especially if there is a failure along the way (process crash, power loss, etc.). I wouldn't necessarily trust a long-running but transient process in RAM to complete (although it may often times be a necessary evil). Short-running processes are usually fine, especially if they can be easily recreated in the event of a processing failure. The DBMS also serves as the central information point for what data has/hasn't been processed, which I suppose counts as process management to some degree.
 

The reason I want to do the application using Django is that all this HAS to have multiple web interfaces and, at the end of the day most media will c--ome through web, and have to be processed as http requests. Also, Django gives me a frame to make this work better organized and clean and I can make the application(s) DB agnostic.


 
What do you mean by 'multiple web interfaces'? You mean multiple daemons running on different listening ports? Different sites using the sites framework? End-user browser vs. API? 

A combination of all that and probably a bit more ... This is something I left out trying to evade the TL;DNR responses: I'm considering having this app return nothing but probably json or xml code for other applications to "feed" from it. (here is that feed word again!), there are a myriad of possible ways this application can be used. This, BTW, would leave all the HTML/CSS/Javascrpt/etc "problems" to someone else ... it might just be the DB app programmer in me trying to avoid dealing with web issues, or I might just be trying to make things harder for me; this is something I haven't really thought much about.

If that's the route you are trying to go, build the API first. The Django REST Framework is an excellent tool. If you want a human-friendly element later, you can slap in your own web front-end and take advantage of the API calls you've already created. I feel you on the JS/HTML/CSS issue, though. I don't know who invented CSS, but it is a completely different mindset from application or database programming, you know, where we expect objective, consistent, and deterministic results. ;-) 
 

Wanting the application to be DB agnostic does not mean that I don't have a choice: I know I have many options to communicate among different python processes, but I prefer to leave that to the DBMS. Of the open source DBMS I know of, only Firebird and PostgreSQL have event that can provide the communication between all the processes involved. I was able to create a very similar application in 2012 with Firebird, but this time I am being restricted to PostgreSQL, which I don't to oppose at all. That application did not involve http requests.


Prefer to leave what to the DBMS? The DBMS is responsible for storing and indexing data, not process management. Some DBMS' may have some tricks to perform such tasks, but I wouldn't necessarily want to rely on them unless really necessary. If you're going to the trouble of writing separate listening daemons, then they can talk to whatever backend you choose with the right drivers.

I understand I'm having the DBMS do some of the process management, but it only goes as far as letting other processes know there is some job to be done, not even what needs to be done. I don't thing the overhead on the DBMS is going to be all that big.

I'd assume you are thinking of some kind of DBMS trigger functionality to fire off other related processes. It may totally be appropriate for your application. Most on this list will jump to Celery because it integrates well with Django and probably is easier to introspect with regards to job management. But hey, that may not even matter if you don't need to manage that set of processes from the web. I'm sure the DB driver would have support to interrogate those processes manually though, you just don't get the benefit of the ORM abstraction in that case, and would make your application much less DB-agnostic. I don't know enough about such features to have a decent opinion either way.
 

This whole application is an idea that's been in my mind for some 7 years now. I even got as far as having a working prototype. I was just starting to learn Python then and my code is a shameful non pythonic mess. But it worked. I used Firebird as my RDMS, and all feeds (again?) would come in and out through an ad-hoc gmail account (with google voice for SMS messaging) I would get the input, process it and return the output within 10 to 40 seconds, with the average at around 20 which is satisfying if you consider the app is not really controlling the "medium". Of course, I never even considered any heaving testing as there were many limitations, the 500 outgoing messages per day being just the first one.  It just proved my concept. ande served as a very good (and long) exercise in Python.

That's actually really good. Having a working baseline should make it easy to improve, even if you end up rewriting the whole thing. The non-Pythonic mess makes my eye twitch a bit, but I'm assuming that you'll be cleaning that up along the way. It also has probably enlightened you to issues with the existing workflow.
 

I recently shared my thoughts with some close friends that linger around other branches of (IT related) knowledge and they liked they idea, hence the request for your input, for which I feel very much obliged.

Thanks a BUNCH!
 
============
DISCLAIMER!
============
I do not mean to argue any of the ideas you and all others have shared with me, on the contrary; you have fed even more my curiosity and curiosity well managed usually turns into knowledge. I can't do different from thanking all of you for that gift.

I don't think you were being argumentative by any means. Start with what you know, play to your strengths, and work outward from there. My only other advice, keep it simple as long as you can. Occam's Razor proves true for most of the programming (and life) problems I've faced (the simplest solution is often the correct one).

-James

Reply all
Reply to author
Forward
0 new messages