Handling bulk data

1 view
Skip to first unread message

David MacQuigg

unread,
Apr 2, 2010, 12:30:36 PM4/2/10
to PyWhip
I've added a high-priority enhancement request to our issues tracker.
This is something that can be done right now, independent of our
choice of web2py or Django, and can be split off as a separate
project, not needing close coordination with our long-range plan.

Here is the description from http://code.google.com/p/pykata/issues/list
'''
We need a way to upload and download data that will handle multiple
exercises in one simple
operation. Ideally, the external data will be in the format shown in
our wiki page [[UploadFormat]],
but for now, we just need to make sure the interface to App Engine
works in both directions.

The bulkloader.py tool does not work, and Google is not motivated to
fix it. There is, however,
some good discussion of how to do this with a Python script. See
Chapter 12: "Bulk Data
Operations and Remote Access" in the book "Google App Engine" by Don
Sanderson.
'''

Abhishek Mishra

unread,
Apr 3, 2010, 5:32:02 AM4/3/10
to pyw...@googlegroups.com
Hi David,

I'll try working on it today. Got some tests on monday though. I think it might be possible to finish it soon.

regards,
Abhishek


--
You received this message because you are subscribed to the Google Groups "PyWhip" group.
To post to this group, send email to pyw...@googlegroups.com.
To unsubscribe from this group, send email to pywhip+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pywhip?hl=en.


David MacQuigg

unread,
Apr 3, 2010, 6:24:18 AM4/3/10
to pyw...@googlegroups.com, Abhishek Mishra
Hey Abhishek. This problem can wait. Your classes should come first.
Don't be like the star football player who can't be on the team because
he ignored academics.

-- Dave

Abhishek Mishra

unread,
Apr 3, 2010, 7:14:36 AM4/3/10
to pyw...@googlegroups.com
Alright :) then I will take on it as soon as I find time.

Abhishek

Abhishek Mishra

unread,
Apr 3, 2010, 12:12:24 PM4/3/10
to PyWhip
Hi David,

So I was just going through the uploadformat specs -
http://code.google.com/p/pykata/wiki/UploadFormat?ts=1270310194&updated=UploadFormat
I wasn't able to copy paste the format to try it out properly, I
notice that in Google Code wiki, we can use {{{ }}} to enclose
verbatim code.
I think the code is looking fine now.

Let me know if i've changed the pre tags to {{{ on the wiki in a right
way.

Abhishek

David MacQuigg

unread,
Apr 3, 2010, 2:45:40 PM4/3/10
to pyw...@googlegroups.com
I tried the {{{ - - - }}} tags earlier, and it was messing up our
multi-line quotes. I've now found another solution to the quoting
problem, so I think we now have perfect formatting. I've taken out the
note about these formatting problems.

The solution was to change the first triple single-quote to triple
double-quote. (Is that a six-quote or what :>) Google's
syntax highlighter now recognizes it as a multi-line quote, and colors
it properly. Before, it was continuing the quote
past its end and including subsequent code lines.

I hate six-quotes, because they look like *** on my monitor. I
generally prefer triple single-quotes for readability.

I would report this to Google as a bug, but they have some much more
serious problems at the moment, so I won't bother.

-- Dave

Abhishek Mishra

unread,
Apr 4, 2010, 4:23:59 AM4/4/10
to pyw...@googlegroups.com
Hi David,

They say have an itch, scratch! So I couldn't wait but write a small parser for problem format.

Here is an independent test code - http://paste.pocoo.org/show/197412/
It tries to parse http://ideamonk.in/code/foo.py for testing.

Some notes - 
* As 6 fields are necessary, only the submissions with 6 or more fields are parsed
* the end of problems is marked with
* I've removed the 
#---
# next exercise here
#---
As shown in example, as this would be a blank problem for purpose of testing.

* Once parsed, the data can be directly pushed to the existing problem submission code, which would reduce us a lot of effort.
The output from there can be brought back to bulk uploader and it it can point out what were the problems in the submitted code.

At this stage I would like to know if the submission parser is doing its work fine.
Do you have any more sample files that I can use on it.

Thanks
Abhishek

Abhishek Mishra

unread,
Apr 4, 2010, 4:31:43 AM4/4/10
to pyw...@googlegroups.com
Here is a test output - http://paste.pocoo.org/show/197414/

On Sun, Apr 4, 2010 at 1:53 PM, Abhishek Mishra <idea...@gmail.com> wrote:
Hi David,

<snip> 

Abhishek Mishra

unread,
Apr 4, 2010, 6:01:55 AM4/4/10
to pyw...@googlegroups.com
I just took a look at Bulkuploader chapter in the book on Google Books. 
It seems like an efficient way, while I was thinking of providing an upload page on the web itself.

We can workout both the tools.

Next weekend I might find some more time to explore it :)

Abhishek Mishra

unread,
Apr 4, 2010, 2:38:21 PM4/4/10
to pyw...@googlegroups.com
Hi everyone,

Just wrote something that might suit as a bulkuploader, though its nothing based bulkuploader provided by AppEngine. That one would be useful too, but couldn't find time to understand that.

For now I've created a sort of bulkuploader in the web interface itself.
Modifies existing create question view and adds upload option on top.

When a user uploads a problem, it gets parsed by the SubmissionParser as discussed today.
One by one the submissions are passed to the same new_edit function as if a form were to be submitted by the user for that problem. Each problem's errors/success report is generated and shown to the user.

I think its better explained by a demo - 

I've used http://ideamonk.in/code/foo.py as a sample upload

Here is a snapshot of my working version (without .svn) - http://ideamonk.in/code/pykata.zip

There are a few issues - 

    * multiple submissions result in duplicate problems, same when you create duplicate problems manually. However here it should intelligently pick new/updated version so that when one corrects their submission after looking at the report, they dont get multiple copies created.

   * some more issues that I might have missed... Let me know when you guys test out :)


regards,
Abhishek

Abhishek Mishra

unread,
Apr 4, 2010, 2:38:59 PM4/4/10
to pyw...@googlegroups.com
Let me know if this patch is committable :)

David MacQuigg

unread,
Apr 4, 2010, 8:48:25 PM4/4/10
to pyw...@googlegroups.com
Looks good. Go ahead and commit it. I'll try it on my local server,
then upload to PyKata.

Nice

-- Dave

Abhishek Mishra wrote:
> Let me know if this patch is committable :) --

David MacQuigg

unread,
Apr 5, 2010, 4:28:32 AM4/5/10
to pyw...@googlegroups.com
Amazing!! It works!! I see just a few issues, and I don't want to rush
you to work on this, so I'll hold off on uploading to the App Engine
until we have more time. Here is what I see so far:

1) We should limit bulk uploads to just a few authorized users.
Otherwise this could be an opportunity for vandalism. One way we could
implement this more securely is not have the "Upload Problems"
invitation on the Add Problem form, but have a special name and password
to be entered in the Name field of Add Problem - something like
"bulkloader x$yz873-Q" You have to get the password right to bring up
the bulkloader form. Otherwise, you just see an error like you had
entered an invalid problem name - no clue that there is something
special about this field. We don't want to invite script kiddies to try
all their dictionary attacks.

2) The leading _underscore on the name of the skeleton function was
there just to avoid name conflicts with the solution function. In the
presentation to the user this should be stripped off. Otherwise you
will get an error when you try to run the doctests on the skeleton function.

See http://code.google.com/p/pykata/wiki/UploadFormat for clarification
of this and a bunch of other details about how we should handle missing
tags, etc.

3) The #-userID tag should not create any new IDs. It should always be
the same as the current session ID, except in one very special case: The
PyKata administrator (currently me) is uploading bulkdata from numerous
users (as in a backup/restore scenario). Maybe for now, we should just
leave out this tag. Since we are bulkloading via a normal form, the
user will always be logged in, and we can assume the userID is already set.

Good luck with your test.

-- Dave

Reply all
Reply to author
Forward
0 new messages