scalable notebook

26 views
Skip to first unread message

Jason Grout

unread,
Jan 20, 2011, 11:48:29 AM1/20/11
to sage-notebook
Following up on the work done at Sage Days, it seems like we have two
main projects that should be completed:

* Porting the notebook over to use Flask on top of Twisted. What is the
status of this? It seemed like it was pretty much ready when Sage Days
finished last week. Do we just need testers?

* Redesigning the entire communication architecture to be
database-centric. I've consolidated some information, answered some
frequently asked questions, and put up a diagram of the architecture
here: http://wiki.sagemath.org/Notebook%20design

What do people think? I guess one big question is how it compares with
the other main proposal for scalable notebook design:

https://docs.google.com/document/d/1uYJXPAWypGgb92QStJ19cW-29y4-hn5hi8oXMR-11TU/edit?hl=en&authkey=CISp9cQB&pli=1#

Should we decide on one or the other, or should both go forward, or
should there be some merging of ideas? I plan to work on the
database-centric design, as it seems to be:

1. easier to manage with smaller more self-contained tasks
2. competitive in scalability
3. Websockets (which I think is a main component of Alex's design) are
probably not going to make it into Firefox 4.0 due to security concerns
[1], so I think that technology may need to mature just a bit more. I
can easily understand what is needed for the database-centric design and
it uses a well-proven architecture and standard existing capabilities.
Also, Flask will not support websockets (i.e., wsgi doesn't support
websockets) and probably won't in the near future. After seeing the
readability improvement when moving to flask, I'm willing to give up
websockets right now to get a more developer-friendly notebook.


I just want to make sure that we don't run out of steam before we get
some steady progress going.

Jason

[1] http://hacks.mozilla.org/2010/12/websockets-disabled-in-firefox-4/

William Stein

unread,
Jan 20, 2011, 1:16:15 PM1/20/11
to sage-n...@googlegroups.com, Gregory Bard
Hi,

I was talking with Greg Bard yesterday, and an idea came up for
another related project that would be much easier, yet still very
helpful to a _lot_ of users. It's to make a very highly scalable
database-oriented secure website that does nothing but "evaluate a
block of Sage code in a clean namespace". The key thing is that we
make it incredibly scalable, e.g., to thousands of users. And somehow
test this scalability.

This could be fairly easy to build starting with the demo here:
http://code.google.com/p/simple-python-db-compute/. It will have to
display graphics, etc., so the javascript is still nontrivial. But
doing this is vastly less daunting than what is mentioned below, and I
can't see how we could do our ultimate goal without easily doing what
I'm proposing here.

I imagine that this could be something that specifically uses apache +
mongodb.

Once it works, we could try to modify it to also serve individual
public interacts.

Does anybody want to help with this?

> What do people think?  I guess one big question is how it compares with the
> other main proposal for scalable notebook design:
>
> https://docs.google.com/document/d/1uYJXPAWypGgb92QStJ19cW-29y4-hn5hi8oXMR-11TU/edit?hl=en&authkey=CISp9cQB&pli=1#
>
> Should we decide on one or the other, or should both go forward, or should
> there be some merging of ideas?
> I plan to work on the database-centric
> design, as it seems to be:

The two designs are in a sense orthogonal, due to their levels of detail.

The "database-centric design" has I think far more details thought
through. That said, Alex's design above is also database-centric.
There's not much real difference between them, except we thought
through every detail of messages going back and forth from the
perspective of making sure it's possible to implement a range of
specific mathematical commands that matter to us (graph editing,
showing plots, calculating). If you look at the top of Alex's design
it has the exact same big three components as the other design (but
the internals are all much more complicated with Alex's design)

> 1. easier to manage with smaller more self-contained tasks
> 2. competitive in scalability
> 3. Websockets (which I think is a main component of Alex's design) are
> probably not going to make it into Firefox 4.0 due to security concerns [1],
> so I think that technology may need to mature just a bit more.  I can easily
> understand what is needed for the database-centric design and it uses a
> well-proven architecture and standard existing capabilities. Also, Flask
> will not support websockets (i.e., wsgi doesn't support websockets) and
> probably won't in the near future.  After seeing the readability improvement
> when moving to flask, I'm willing to give up websockets right now to get a
> more developer-friendly notebook.
>
>
> I just want to make sure that we don't run out of steam before we get some
> steady progress going.

I propose we do the warm-up I mentioned above. It will quickly and
definitively bring the ability for anybody to compute with Sage over
the web robustly, though without all the session, etc., stuff. And
since it will be very simple, it will be easier to make it cell-phone
friendly.

Also, what I'm proposing is a lot like Wolfram Alpha or
http://magma.maths.usyd.edu.au/calc/.

-- William

--
William Stein
Professor of Mathematics
University of Washington
http://wstein.org

Rob Beezer

unread,
Jan 20, 2011, 2:38:37 PM1/20/11
to sage-notebook
On Jan 20, 10:16 am, William Stein <wst...@gmail.com> wrote:
> It's to make a very highly scalable
> database-oriented secure website that does nothing but "evaluate a
> block of Sage code in a clean namespace".

As a highly-interested observer to the scalability effort, this gets a
big +1 from me.

Think http://math1.skku.ac.kr/wap_html but on steroids.

As William said, this sounds like a great warm-up project that would
help with a lot of the design decisions for the larger notebook
project.

And long-term, I think this would be a very valuable site for three
purposes (a) people trying out Sage for the first time (marketing),
(b) a significant amount of student use as a super-calculator sans
account, and (c) mobile device use in a calculator-mode.

Rob

Matthew Turk

unread,
Jan 20, 2011, 3:02:02 PM1/20/11
to sage-n...@googlegroups.com

Hi all,

As a highly-interested observer to the web notebook process, this
could be enormously beneficial to computational scientists, as well.
For instance, in Astrophysics the question in the past has been how to
provide an interface to running analysis at remote locations while
avoiding the hassle and overhead of SSH, X11 tunneling, etc etc, and
having a "single cell" mode that dispatched jobs transparently from a
frontend would be extremely useful.

Just for a simple example, imagine the case where you have a group
that runs large suites of Galaxy Cluster simulations. The datasets
produced by these calculations may be many gigabytes or even terabytes
in size, but it may be useful to provide the ability for remote
collaborators to interact with them. (This is one of the things the
NSF Teragrid Science Gateway program was designed to address.)
Typically this is done by either providing login information or copies
of the data, which then get shipped around, people log in via SSH,
tunnel images back and forth, etc etc.

However, with this environment, where you have the separation between:

* UI (including script / block-of-code and image display)
* Job dispatch and initiation
* Job execution

you would have the ability to address each component individually.
The dispatch could dispatch to an MPI-enabled cluster, single
processor, etc. This means rather than shipping around data or
providing manual logins, analysis (exploratory or otherwise) could be
conducted in the browser, without the overhead typically endured.

By providing the "single-block-of-code" approach, it becomes a very
general solution -- applicable as well to environments of
data-intensive analysis and exploration! In particular, this would
meet a need that's not necessarily well-met by stateful, notebook
exploration.

Anyway, I think this is a very exciting direction for the sage-notebook!

Best,

Matt

Jason Grout

unread,
Jan 20, 2011, 3:33:40 PM1/20/11
to sage-n...@googlegroups.com
On 1/20/11 12:16 PM, William Stein wrote:


>
> The two designs are in a sense orthogonal, due to their levels of detail.
>
> The "database-centric design" has I think far more details thought
> through. That said, Alex's design above is also database-centric.
> There's not much real difference between them, except we thought
> through every detail of messages going back and forth from the
> perspective of making sure it's possible to implement a range of
> specific mathematical commands that matter to us (graph editing,
> showing plots, calculating). If you look at the top of Alex's design
> it has the exact same big three components as the other design (but
> the internals are all much more complicated with Alex's design)


I see there being rather fundamental differences between the two
designs. By database-centric, I meant that *everything* goes through
the database and nothing really lives outside of the database (i.e.,
everyone goes through the database for any information, and immediately
puts things back in the database). In Alex's design, the database just
stores worksheets and is updated when a worksheet is saved. The primary
work is done outside of the database with a server-side process that
maintains state and communicates with the workers.


> I propose we do the warm-up I mentioned above. It will quickly and
> definitively bring the ability for anybody to compute with Sage over
> the web robustly, though without all the session, etc., stuff. And
> since it will be very simple, it will be easier to make it cell-phone
> friendly.
>
> Also, what I'm proposing is a lot like Wolfram Alpha or
> http://magma.maths.usyd.edu.au/calc/.

That sounds like a great idea. Like you said, it's a good warm-up to
shake out any bugs or unforseen complications with using the database or
the protocol we thought out, as well as a way to easily test scalability
(it would be trivial to script something that would hit the page with
1000 simultaneous computations). Right now, it's been a little daunting
(no time...) to completely rewrite the notebook for the new
architecture. Your idea is certainly more doable in what time I have.

Having the Sage process start by forking will be critical to making this
webpage fast enough to be really useful, I think, as well as making the
load on the server bearable. What security implications are there in
forking a Sage process several times for different users?

Jason

Tom Boothby

unread,
Jan 20, 2011, 3:46:02 PM1/20/11
to sage-n...@googlegroups.com
On Thu, Jan 20, 2011 at 10:16 AM, William Stein <wst...@gmail.com> wrote:
> I was talking with Greg Bard yesterday, and an idea came up for
> another related project that would be much easier, yet still very
> helpful to a _lot_ of users.   It's to make a very highly scalable
> database-oriented secure website that does nothing but "evaluate a
> block of Sage code in a clean namespace".   The key thing is that we
> make it incredibly scalable, e.g., to thousands of users.  And somehow
> test this scalability.

I think this is a GREAT idea. If I recall correctly, didn't we
implement almost exactly this in about 30 minutes during Bug Days? It
would need some polish and the addition of sessions, but it's almost
there, right?

Jason Grout

unread,
Jan 20, 2011, 4:00:12 PM1/20/11
to sage-n...@googlegroups.com

I think it would need (maybe grouped in version numbers):

0.1:
* implement a computation id assigned to each post request
* javascript to receive a computation id after request and continue
polling for that result
* make workers working in a virtual machine for security
* implement forking Sage to start up a new worker process to lower
load and latency

0.2:
* the "stream" protocol we discussed for transmitting generated files
or other information
* javascript to handle the different streams (e.g., an image stream, a
jmol stream, etc.)

0.3:
* public interacts

Thanks,

Jason

William Stein

unread,
Jan 20, 2011, 5:25:23 PM1/20/11
to sage-n...@googlegroups.com
On Thu, Jan 20, 2011 at 1:00 PM, Jason Grout
<jason...@creativetrax.com> wrote:
> On 1/20/11 2:46 PM, Tom Boothby wrote:
>>
>> On Thu, Jan 20, 2011 at 10:16 AM, William Stein<wst...@gmail.com>  wrote:
>>>
>>> I was talking with Greg Bard yesterday, and an idea came up for
>>> another related project that would be much easier, yet still very
>>> helpful to a _lot_ of users.   It's to make a very highly scalable
>>> database-oriented secure website that does nothing but "evaluate a
>>> block of Sage code in a clean namespace".   The key thing is that we
>>> make it incredibly scalable, e.g., to thousands of users.  And somehow
>>> test this scalability.
>>
>> I think this is a GREAT idea.  If I recall correctly, didn't we
>> implement almost exactly this in about 30 minutes during Bug Days?  It
>> would need some polish and the addition of sessions, but it's almost
>> there, right?
>>
>
> I think it would need (maybe grouped in version numbers):
>
> 0.1:
>  * implement a computation id assigned to each post request

+1

>  * javascript to receive a computation id after request and continue polling
> for that result

+1

>  * make workers working in a virtual machine for security

Could be in 0.2. For now could use the same worksheet user on boxen
that is used by sagenb.org.

>  * implement forking Sage to start up a new worker process to lower load and
> latency

Could be easy using @fork:
http://trac.sagemath.org/sage_trac/ticket/9631

Also need:
* configure a mod_wsgi setup to server flask.

> 0.2:
>  * the "stream" protocol we discussed for transmitting generated files or
> other information
>  * javascript to handle the different streams (e.g., an image stream, a jmol
> stream, etc.)
>
> 0.3:
>  * public interacts
>
> Thanks,
>
> Jason
>
>

--

Dan Drake

unread,
Jan 20, 2011, 8:23:59 PM1/20/11
to sage-n...@googlegroups.com
On Thu, 20 Jan 2011 at 11:38AM -0800, Rob Beezer wrote:
> On Jan 20, 10:16 am, William Stein <wst...@gmail.com> wrote:
> > It's to make a very highly scalable database-oriented secure website
> > that does nothing but "evaluate a block of Sage code in a clean
> > namespace".
>
[...]

>
> And long-term, I think this would be a very valuable site for three
> purposes (a) people trying out Sage for the first time (marketing),
> (b) a significant amount of student use as a super-calculator sans
> account, and (c) mobile device use in a calculator-mode.

There's also (d) an online version of SageTeX that uses this system
instead of the current "simple server API".

Dan

--
--- Dan Drake
----- http://mathsci.kaist.ac.kr/~drake
-------

signature.asc

kcrisman

unread,
Jan 20, 2011, 10:13:49 PM1/20/11
to sage-notebook
> 0.3:
>   * public interacts
>

Obviously I have no technical contribution here, but have to say that
anything which allows this would be stupendous (think W. Demos without
needing to buy anything). I would *certainly* provide lots of
testing!

- kcrisman

Jason Grout

unread,
Jan 20, 2011, 10:51:46 PM1/20/11
to sage-n...@googlegroups.com


Since I will be working on an online library of interacts this summer, I
will definitely get something like this up by then if it's not already
up. I'm also going to announce this project to our CS students and see
if anyone is interested in forming a small working group here at Drake
for helping with this.

Jason

Rado

unread,
Jan 21, 2011, 9:29:09 PM1/21/11
to sage-notebook
On Jan 21, 12:48 am, Jason Grout <jason-s...@creativetrax.com> wrote:
> Following up on the work done at Sage Days, it seems like we have two
> main projects that should be completed:
>
> * Porting the notebook over to use Flask on top of Twisted.  What is the
> status of this?  It seemed like it was pretty much ready when Sage Days
> finished last week.  Do we just need testers?

I have been working on the TODO list at the bottom of
http://wiki.sagemath.org/Notebook%20scalability . The only thing left
is to run the selenium tests and write tests using flasks native test
suite. Check out my code at https://code.google.com/r/rkirov-flask/ .

Have we clearly identified the bottleneck for the current notebook
(with the selenium tests presumably)? Is there a chance that there is
a improvement from of the flask rewrite? What if we run the WGSI flask
app with apache (I dont know how to do that, but its feasible) or even
upgrade twisted (since we don't really need twisted web 2 anymore)?

For the other projects, I can help with the client-side (javascript).
The one shot eval sounds like a good start to iron out the
specifications and the communication protocol before starting with the
whole notebook. As a final feature, the new flask notebook supports
openid, which I think will be more attractive to new users; all it
takes is a gmail account.

Rado

William Stein

unread,
Jan 21, 2011, 9:44:32 PM1/21/11
to sage-n...@googlegroups.com
On Fri, Jan 21, 2011 at 6:29 PM, Rado <rki...@gmail.com> wrote:
> On Jan 21, 12:48 am, Jason Grout <jason-s...@creativetrax.com> wrote:
>> Following up on the work done at Sage Days, it seems like we have two
>> main projects that should be completed:
>>
>> * Porting the notebook over to use Flask on top of Twisted.  What is the
>> status of this?  It seemed like it was pretty much ready when Sage Days
>> finished last week.  Do we just need testers?
>
> I have been working on the TODO list at the bottom of
> http://wiki.sagemath.org/Notebook%20scalability . The only thing left
> is to run the selenium tests and write tests using flasks native test
> suite. Check out my code at https://code.google.com/r/rkirov-flask/ .
>
> Have we clearly identified the bottleneck for the current notebook
> (with the selenium tests presumably)?

Certainly not.

> Is there a chance that there is
> a improvement from of the flask rewrite?

It's possible, but I doubt it.

> What if we run the WGSI flask
> app with apache (I dont know how to do that, but its feasible) or even
> upgrade twisted (since we don't really need twisted web 2 anymore)?

I doubt any of that would help at all. I don't think WSGI flask +
apache is even possible.

> For the other projects, I can help with the client-side (javascript).

Awesome!

> The one shot eval sounds like a good start to iron out the
> specifications and the communication protocol before starting with the
> whole notebook. As a final feature, the new flask notebook supports
> openid, which I think will be more attractive to new users; all it
> takes is a gmail account.

That is so awesome!!!

>
> Rado

Jason Grout

unread,
Jan 22, 2011, 12:09:30 AM1/22/11
to sage-n...@googlegroups.com
On 1/21/11 8:29 PM, Rado wrote:
> On Jan 21, 12:48 am, Jason Grout<jason-s...@creativetrax.com> wrote:
>> Following up on the work done at Sage Days, it seems like we have two
>> main projects that should be completed:
>>
>> * Porting the notebook over to use Flask on top of Twisted. What is the
>> status of this? It seemed like it was pretty much ready when Sage Days
>> finished last week. Do we just need testers?
>
> I have been working on the TODO list at the bottom of
> http://wiki.sagemath.org/Notebook%20scalability . The only thing left
> is to run the selenium tests and write tests using flasks native test
> suite. Check out my code at https://code.google.com/r/rkirov-flask/ .

Fantastic! What is needed to write tests using flask's native test
suite? Is that a small task or a big task? Are you referring to this:
http://flask.pocoo.org/docs/testing/ ?

Jason

Rado

unread,
Jan 22, 2011, 4:51:14 AM1/22/11
to sage-notebook

> Fantastic!  What is needed to write tests using flask's native test
> suite?  Is that a small task or a big task?  Are you referring to this:http://flask.pocoo.org/docs/testing/?
>
> Jason

Yeah, although it will be overlapping with some of the current
selenium tests. But I guess more tests never hurt nobody. There is a
difference that selenium is more of a real browser test, while flask
tests are just sending get and posts. So selenium is in-fact checking
the rendering and the javascript, which will make it slower but more
thorough. Ideally, selenium test should only cover that area, and
flask tests should cover get/post return the right pages. Then if
selenium tests break, the javascript should be examined, while if
flask tests break, the flask code should be examined.

It is a small task because it doesn't make sense to write too many
tests, since there is an upcoming notebook rewrite. It would be good
to have a file with a few tests, for future development.

That is for functional testing; we also need load/stress testing.
Unfortunately, Selenium is not recommended for load/stress testing
[1], due to the fact that its starting a real webbrowser for each
test. The other frameworks mentioned in that FAQ - Jmeter and grinder
- are heavy Java stuff :/ However, I found Pylot[2] and FunkLoad[3] to
be python load/performance testing packages (being able to put out
concurrently a large number of requests). I will try to see if I can
reproduce some of the current performance issues with those. I am
curious how many concurrent users can make a home-ran Sage Notebook
server slow down significantly.

[1] http://selenium-grid.seleniumhq.org/faq.html#would_you_recommend_using_selenium_grid_for_performanceload_testing
[2] http://www.pylot.org/
[3] http://funkload.nuxeo.org/

Alex Leone

unread,
Jan 22, 2011, 6:45:53 AM1/22/11
to sage-n...@googlegroups.com
Forking and setuid sounds like the way to go.  Python reads a ton of files off the disk otherwise.


On Thursday, January 20, 2011 12:33:40 PM UTC-8, Jason Grout wrote:

I see there being rather fundamental differences between the two
designs.  By database-centric, I meant that *everything* goes through
the database and nothing really lives outside of the database (i.e.,
everyone goes through the database for any information, and immediately
puts things back in the database).  In Alex's design, the database just
stores worksheets and is updated when a worksheet is saved.  The primary
work is done outside of the database with a server-side process that
maintains state and communicates with the workers.


Yup.  Having everything go through the DB will make things simpler process-wise, but eventually all the state updates for temporary sessions will start thrashing the DB if too many things are happening at once.  If there was some set of processes that each served some simple purpose (ala the unix philosophy of doing one thing well) and talked to each other, then it would make everything much more responsive than having to poll the database for state updates.


I agree that vanilla websockets aren't supported yet, but check out http://socket.io/ and http://pypi.python.org/pypi/SocketTornad.IO/0.1.3  It's roughly the same interface, but supported on everything, including ipad's, etc.

 - Alex

Harald Schilly

unread,
Jan 22, 2011, 7:09:55 AM1/22/11
to sage-n...@googlegroups.com
On Sat, Jan 22, 2011 at 12:45, Alex Leone <acl...@gmail.com> wrote:
> but eventually all the state updates for temporary sessions will start
> thrashing the DB if too many things are happening at once.

One of the nice things of mongodb is sharding. this means, you can
define a cluster of machines with a "master" where the data is
distributed among them. simple example, the lowest bit of the hash of
a designated "shard key" for an object decides, if the data is on
machine a or b. I think, after looking at
http://www.mongodb.org/display/DOCS/Choosing+a+Shard+Key that a good
shard key is the user-id. Then, the user's worksheets are always on
one machine and concurrent writes to different users are distributed -
at the same time, queries/updates for one user only hit one machine.
(paragraph "query isolation")

H

Jason Grout

unread,
Jan 22, 2011, 9:51:35 AM1/22/11
to sage-n...@googlegroups.com
On 1/22/11 5:45 AM, Alex Leone wrote:
> Forking and setuid sounds like the way to go. Python reads a ton of
> files off the disk otherwise.

Again, are there any security issues we'd need to be aware of? For
example, what happens if a process crashes? Is it easier for someone to
elevate their permissions if the process was forked?

My guess is that we don't have to worry, but it's always safe to ask
about security issues.


>
>
> On Thursday, January 20, 2011 12:33:40 PM UTC-8, Jason Grout wrote:
>
> I see there being rather fundamental differences between the two
> designs. By database-centric, I meant that *everything* goes through
> the database and nothing really lives outside of the database (i.e.,
> everyone goes through the database for any information, and immediately
> puts things back in the database). In Alex's design, the database just
> stores worksheets and is updated when a worksheet is saved. The primary
> work is done outside of the database with a server-side process that
> maintains state and communicates with the workers.
>
>
> Yup. Having everything go through the DB will make things simpler
> process-wise, but eventually all the state updates for temporary
> sessions will start thrashing the DB if too many things are happening at
> once. If there was some set of processes that each served some simple
> purpose (ala the unix philosophy of doing one thing well) and talked to
> each other, then it would make everything much more responsive than
> having to poll the database for state updates.


It sounds like this single-cell public interface would serve as a good
testbed for both ideas. If it's sufficiently easy to build such a thing
with both designs, then we can throw a huge amount of activity at it and
more objectively evaluate which we'd like to see grow into the full
notebook.

Jason

Reply all
Reply to author
Forward
0 new messages