CSV / BulkLoading Rewrite

115 views
Skip to first unread message

Jeremy Shipman

unread,
Apr 1, 2015, 9:15:16 PM4/1/15
to silverst...@googlegroups.com
I’ve recently been doing some significant work to overhaul the BulkLoader system. I’ve been releasing my work into a module I’ve called ‘importexport’ for now. Potentially this work could get approved as a future replacement for SilverStripe’s core bulk loading functionality.


My aim is to make the code a bit more S.O.L.I.D, easy to develop with, as well as fix some bugs and introduce more features. http://en.wikipedia.org/wiki/SOLID_%28object-oriented_design%29

 
Here is a summary of my changes so far:

New Features

Users can define column mappings via CSVFieldMapper / GridFieldImporter.
CSV files can be previewed via the CSVPreviewer class.
Records can be skipped during import. Skipped records are recorded in result object.
Introduced BulkLoaderSource as a way of abstracting CSV / other source functionality away from the BulkLoader class.
Introduced ListBulkLoader for confining record CRUD actions to a given DataList (HasManyList).
Decoupled CSVParser from BulkLoader. Column mapping is now performed in BulkLoader on each record as it is loaded.
Replaced CSVParser with goodby/csv library.

Bug Fixes

Validation failing on DataObject->write() will cause a record to be skipped, rather than halting the whole process.
Prevented bulk loader from trying to work with relation names that don't exist. This would particularly cause issues when CSV header names contained a ".".


From here...

One particular feature I need is the ability to perform duplicate checking on a relation ID. Implementing this will require the most reworking of the code so far. Here is my proposal for doing this work: https://github.com/burnbright/silverstripe-importexport/issues/11



It would be great if people could review the code I’ve written so far. Try it out, report issues, comment on my proposal  in issue #11. I’d also love if anyone could contribute any code to finish off the 0.1.x milestone. https://github.com/burnbright/silverstripe-importexport/milestones/0.1.x

Jeremy Shipman

unread,
Apr 8, 2015, 1:03:29 AM4/8/15
to silverst...@googlegroups.com
I’ve just merged in a whole lot more work on this which allows customising the way that relation objects are created/joined. You can now also perform a duplicate checks with relations. I’ve updated the README with lots of explanation and examples.

I’ll be wrapping up my efforts on this module very soon. There are a few things which could be done I’ve added to github issues.
Please do try this out when you can and give any feedback.

Ingo Schommer

unread,
Apr 14, 2015, 6:13:52 AM4/14/15
to silverst...@googlegroups.com
Hey Jeremy, great work! Glad somebody untangled my original BulkLoader API :) I'm a bit hesitant about adding this to core because it adds another dependency (goodby/csv). Was your main motivation for using it the improved memory management?

Maybe you could write a paragraph about converting existing BulkLoader use in your README to determine how much API breakage we'd be talking about?

I think the "next frontier" for import/export in core is figuring out how to deal with data amounts that either exceed the PHP memory size or the max execution limit. Did you have some ideas regarding this? Most likely through a queue, although the same concerns about adding a core dependency apply here. Might just be a matter of allowing an optional $startRow parameter in the BulkLoader API, the bulk of the work will be on the GridField UI to visualise the required background processing and filling the queue.

--
You received this message because you are subscribed to the Google Groups "SilverStripe Core Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to silverstripe-d...@googlegroups.com.
To post to this group, send email to silverst...@googlegroups.com.
Visit this group at http://groups.google.com/group/silverstripe-dev.
For more options, visit https://groups.google.com/d/optout.



--
Ingo Schommer | Solutions Architect
SilverStripe (http://silverstripe.com)
Mobile: 0221601782
Skype: chillu23

Mark Guinn

unread,
Apr 14, 2015, 3:28:35 PM4/14/15
to silverst...@googlegroups.com
I agree. Great work on this Jeremy. I’ve done some work on these kinds of imports with a queue which I’d be happy to send that your way if you think it would help. Ingo's right though that the biggest hangup there is going to be standardizing on one of the several options out there. I was using this one: https://github.com/studiobonito/silverstripe-queue but I know that’s probably not the most common one. One option might be to build something like Rails 4.2’s Active Job (http://edgeguides.rubyonrails.org/active_job_basics.html) which acts as an interface to a range of queue backends.

Ingo Schommer

unread,
Apr 14, 2015, 4:31:52 PM4/14/15
to silverst...@googlegroups.com
The https://github.com/silverstripe-australia/silverstripe-queuedjobs module was designed as this abstraction layer, it has cron as well as Gearman integration. I can't think of too many actual use cases in core for queues other than CSV import/export, so there isn't really a strong case for a tighter core integration. So it might be actually beneficial to have this functionality in a module since we can be more flexible with adding dependencies to other modules like queuedjobs.

Can anybody think of other realistic use cases of a queue for the core feature set?
I've got Google Sitemap generation, and batch actions on large object collections (e.g. batch publish).

Christopher Pitt

unread,
Apr 14, 2015, 4:39:34 PM4/14/15
to silverst...@googlegroups.com
Sending email, warming cache, importing AD users, AV scanning, moving assets to cloud stores...

Ingo Schommer

unread,
Apr 14, 2015, 4:49:49 PM4/14/15
to silverst...@googlegroups.com
That's why I said "realistic" and "core" ;) I have no doubt that queueing is useful for modules.

- Sending email - We don't send large amounts of email from core functionality
- Warming cache, Importing AD users, AV scanning - Not a core feature as such
- Moving assets to cloud stores - you mean after upload once we have this feature in place for core? Yep, could be a valid use case

Martimiz

unread,
Apr 15, 2015, 7:53:00 AM4/15/15
to silverst...@googlegroups.com
>> ... exceed the PHP memory size or the max execution limit...

Sometimes, instead of using a queue, I  use another approach by allowing the user to perform an operation in batches. For now this is mostly limited to tasks, where the user is presented with a new link to perform the next batch after the earlier one succeeded. Works well for users that want to feel 'in control' and see results right away. Maybe considered nonsense from a developer point of view, but hey... Does the Queue module support something like this? Or could it possibly be implemented in the bulk-import button structure?

Martine
Reply all
Reply to author
Forward
0 new messages