How to import a folder structure from existing site

53 views
Skip to first unread message

Jon Ku

unread,
May 8, 2024, 8:31:05 PMMay 8
to dotCMS User Group
Hello.

I am moving from a site that has about 400 nested folders which are used for navigation.

The existing site is database-drive so I can easily export the hierarchy of folders as needed.

I've tried a few methods to import this folder structure into my dotCMS instance, for example creating a dummy filesystem on my Mac and using WebDav to copy that into dotCMS.

Because the folder names have spaces, ampersands and colons in their names, the Mac filesystem and WebDav have a hard time handling those. I've tried using different escaping and encoding schemes but the results are poor, and WebDav fails to process them.

Is there a recommended method for doing this? I hope to return the dotCMS identifier for each folder and store that, so pages can be related to their destination folder.

Once I have the folder structure in place we will need a method to create pages within each folder, and then import the appropriate content for each page. Again there is a large number of pages but they are data-driven and can be exported as content types using webhooks (tested already) or perhaps another method through the API.

Any guidance is much appreciated, thank you.

- Jon

Will Ezell

unread,
May 9, 2024, 9:25:21 AMMay 9
to dot...@googlegroups.com
Jon:

Why create the folders?  Are the pages all similarly structured content wise or are they more like blog entries with metadata and a body?   If this is the case, why not create a content type - weirdFolderPage - or whatever with a field for the url slug and then create a detail page that can serve them?    You do not need to use URL mapping or anything, you could just use a vanityURL regex and point all traffic under the folders to a specific detail page and then on that page, use the url to find and serve the appropriate content.  You might need to set an esCustomMapping for the field you want to search by- set it to keyword so it will accept your ampersands and slashes/colons.

Or a hierarchical content type that is related to itself that you can use to nest and build navigation?  We do this with our documentation site for example- all navigation is built through relationships.

Or finally, if you are married to the folder idea, and assuming you are using an up-to-date version of dotCMS I would suggest you could the dotcli and its dotcli files push command to create the folders (they will need at least one file asset in them to push).  The dotcli is built for exactly this type of interaction and should be able to push your files folders right from your file system - hopefully even with their weird folder names.  No promises though, as there are some file names that dotCMS does not support out of security concerns.

See this page for the latest releases and here is the documentation page for the dotcli:

Again, the dotcli requires the target dotCMS version to be modern - either an agile release or maybe even our newly minted 24.04 LTS.  If you are on an older version I would suggest updating to a new one before attempting.



--
http://www.dotcms.com - Open Source headless/hybrid CMS
---
You received this message because you are subscribed to the Google Groups "dotCMS User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dotcms+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dotcms/15c67a0c-a178-4c7f-a5d5-6fa5baf1ae40n%40googlegroups.com.


--



382 NE 191st St #92150
Miami, Florida 33179-3899
Main: 
305-900-2001 | Direct: 978.294.9429

Mark Pitely

unread,
May 9, 2024, 10:53:37 AMMay 9
to dot...@googlegroups.com
I agree with Will - unless you have some strong reason to keep this data in 'folders' (perhaps a body of editors who are resistant to training?) you should move past folder hierarchy. Three's no utility in it and it just leaves you trapped the way you are. 
I'd use the method Will didn't elaborate: content structure (or two, folder + page) with URLMapping. Probably stronger as just weirdPage structure with what was a folder being a 'parent'. This way you could move and nest existing things easily.
After you export your CSV from the original site (which may or may not be in dotCMS, you didn't specify), you could probably just use Excel tools to fix the naming convention (replace all spaces with underscores, replace all invalid characters with html encoding) but five mintues of find-replace that has to be done once is probably simpler than trying a programmatic solution. 
Keep the original folder text name as an additional field so that you can find any you might have missed!
Relationships would be the best way to tie these all together, but they are hard to do without writing code (because you won't be able to keep/control the *new* identifiers). You probably don't want to do them by hand. You can use any text field as a key, though, and url works. You'd have parent, which could be '/toplevel' or '/toplevel/secondlevel', etc. 
If you created just the top level structure, your 'folders' via import, you could then export from the new dotCMS structure with identifiers, which you could then import into your working CSV. That would allow you to create proper relationships. 

M


Jon Ku

unread,
May 9, 2024, 12:03:37 PMMay 9
to dotCMS User Group
Thank you both for your prompt replies, very useful ideas!

This is a new dotCMS instance, the current site is a knowledge base with a distributed group of content authors who do that as a small part of their job. It contains both text-based content (articles and how-tos) as well as structured content which is similar to e-commerce, representing products and promotions.

It in a bespoke CMS written as a model/view/controller pattern, with a small group of developers who maintain it. The URLs are simple, with the content type as one or two folders and the URL string with an identifier to specify the content item: 


There is a header navigation and on-page menus, and breadcrumbs with the last entry being the content title:

News > Latest News > Process Update
Teams > Privacy Request Center > Incident Support > Compensation

After writing my question yesterday I've started viewing the videos and have a better understanding of what is possible in dotCMS. 

My original thought for using a folder hierarchy was to spare the authors and editors any pain in placing and finding their content within the overall site, although currently we have an "author dashboard" which shows them each their own content only, with the possibility of searching for other content items or opening them directly from the front end. We manage the hierarchy using a tool that generates an artificial taxonomy which is used as navigation, and as a breadcrumb. It seems to make sense to follow this structure naturally.

This pattern seems to map onto a category-based solution that would have three levels, and in some isolated cases four. In that case I think the navigation and breadcrumbs would be vtl, but that is not daunting. The structure decision is the big one.

The final suggestion seems to have legs too -- import the tidied-up hierarchy as CSV, then export that to get the mapping of folders to identifiers which could the be used to export content items and create those relationships. I'd like to explore that option in more detail. It's also a great idea for search etc. to store within each content item the original "folder names" (taxonomy) and also perhaps the original ID, and mirror that with identifiers stored in the original schema to map the content.

I need to think about how best to map the existing taxonomy into dotCMS, either as hierarchical folders (dumb), nested categories (maybe) or relationships between content types (or a single recursive one) that represent each level. Next step would be to export that back to the original site and use those to place the content items correctly.

Considerations:

Migration of hierarchy and content (okay if clumsy or tedious since it's a one-time deal)
Flexibility to place new content in correct location, move content or branches to another location
Need to have the same content item in multiple locations
  - I think this requires that the leaf items be pages containing content items
  - There is a dotCMS behaviour to manage that: "Edit all or just this one content item"'
Set up a simple UI for content authors showing their own content
Clean UI for editors and admins to manage taxonomy and placement of content
Create navigation and breadcrumb widgets
Translate/rewrite site URLs embedded within content (manageable)
Search -- this is a topic best left for later, we are comfortable with ES
Analytics -- again, a later topic

Under a page/template/container/contentlet structure, what's the best way to a) create pages for each leaf, and then b) attach the appropriate contentlet.

Thinking out load, I believe we would import the structure, export the identifiers, import the pages, export the page identifiers, then import the contentlets for each page?

I'm still wrapping my head around how to deal with dotCMS via API, webhooks, CSV import and so on, and not completely sure what's possible through the UI or bits of code. I can import content items via webhook and return the identifiers, but have't come across how to do that for pages if that's possible. The CLI is another possibility but I haven't gone there, somewhat daunting in the docs!

I'm still working my way through the content architect videos, fell asleep during the workflow one but that seems to be a strong way to adjust the overall backend UI and flow.

Once again thanks for the feedback, I'm off to watch some more videos and do some experiments.

- Jon






John Michael Thomas

unread,
May 9, 2024, 12:15:23 PMMay 9
to dotCMS User Group
In dotCMS, folders serve several purposes.  They:
1. Define the URL path to the pages & files within them.
2. Allow you to set different permissions for the pages/files in different folders and subfolders.
3. Allow you to set different default Page or File content types for each folder.
    - So, when a new page or file is added to a folder, by default it will be of a specific page or file content type.
4. Provide a sort of taxonomy, where related pages/files are grouped together (and can be searched for) by path.

So, in general, if you need all of these things, and the grouping of your pages in folders matches across all 4 of these (meaning, for example, that pages that share the same URL path should also all use the same page template and should also all be editable or not by the same users/roles), then folders  probably make the most sense.  Otherwise, you might find it easier to treat the URL path of each page as a property of the page, rather than an actual folder - which is basically what Will and Mark are both suggesting.

If you use URL Maps and/or Vanity URLs, you can handle the URL path (#1 above) and taxonomy piece (#4 above) without using folders.  If your URL paths can be easily broken out into a pattern, then we'd generally suggest you use URL maps, because they're much more flexible than folders, and usually make it easier to ensure new files/pages end up in the right place by just setting a field value in the content (instead of having to navigate the folder structure).

If you use different page/file content types for different pages, you can handle both #2 and #3 without needing folders (and possibly in a way that's easier for your content editors to manage than requiring them to navigate down to a specific folder).  So, for example, if you have several different layouts for different kinds of pages (landing pages vs. product pages vs. support pages, for example), then you can create different page content types for each type of page, and control both permissions and things like the default page template used via the content type.

You can also handle #4 using another dotCMS taxonomy (and again, possibly in a way that's easier for your content editors than folders are).  This can be Relationships as Mark suggested, or Tags, Categories, or even a select field in the file/page content type.

It may take a little work (though as Mark said, possibly only in a spreadsheet) to import your existing pages to use these dotCMS features.  But if you do, you'll find it much easier to manage them and to create new pages easily than you will if you just import your existing folder structure directly.

I think one important thing to understand is that if you'd like to gain the power of these dotCMS features at any point, it will probably be much easier for you to import your content to use them now.  If you import using a folder structure now, and want to change it later, you'll have a lot of work to do to make that change.

Hope it helps,
John
Product Manager, dotCMS

Mark Pitely

unread,
May 9, 2024, 1:11:13 PMMay 9
to dot...@googlegroups.com
dotCMS is really a great power tool, in the sense that there's always at least two ways to accomplish any task (and sometimes 5). That's also part of the problem!
Firstly, everything in dotCMS is structured content. Even the idea of 'pages' and 'folders' are actually abstracted away in the first place. There is no real filesystem, which is why your WebDAV solution isn't ideal.
Everything is in a database. 
Secondly, dotCMS can be 100% headless or 100% CMS and anything in between. You can live in the API, but you can also do everything in a Wordpress-y way. It really depends on how your org is set up and how much developer access and need you have. 
Most of my suggestions have been on the easy end - how to set this up without writing a ton of Velocity or do any API calls - how to get it to a place where content editors can take over. 
I'm currently doing the same thing you are, migrating a huge Wordpress-and-flat-HTML site into a new dotCMS implementation; I've done this several times. 
So, right now, I am really ready to discuss this and share my implementation code.

How many top level folders do you have? You suggested 400 total- but I assumed that many are subfolders.
Do you have a need for folder-level security and access? Are these trees handled by disparate groups?
I am assuming some comfort with scripting (PHP, etc) and JS (API fetches), if that's not the case, let me know. If you'd prefer the JS API approach, we have different tactics.

M


Jon Ku

unread,
May 9, 2024, 3:44:41 PMMay 9
to dotCMS User Group
Thanks John, I do understand that we need to get the structure right from the beginning ... we are developers who currently build and support a large and complex CMS, which we are directed to migrate away from towards a CMS product, and dotCMS is the clear winner for things like native multi-language support, breadth of features, extensibility and maturity. We look to give our users a very similar if not identical experience, and a back end that is simple and trainable for our large group of content creators who are domain experts in their areas, but not designers or web savvy.

The idea to use URL maps makes sense because that's essentially what we do now, my question is more focused on how to support the hierarchy, taxonomy, within dotCMS for the best result in terms of flexibility and ease of use. There is a concept of multiple sites (about 20) for different business units, all share the same codebase and templates, with some CSS differences and both shared and unique content in each. Each subsite has between 10 and 20 top-level "folders", with 3 subfolders and occasionally a 4th one. There are about 5500 different folders, with up to 500 or so in each subsite. These subsites are now subdomains and that is a possibility in dotCMS but not required.

I think this would be a pretty big CSV since each content item would need to be listed. We do have methods for batching API calls for bulk exports to our search index, and then incremental updates for ongoing maintenance.

Within that structure are some 60,000 text content items and another 100,000 structured content items. The structured content types are generally navigated by search, with different categories (metadata) to filter and focus the searches by region, timeframe, customer type and so on. We see that filtering process as mapping well to dotCMS categories for Elastic Search and have developed effective front-end UIs for that process.

Hi M (pit?) and thanks for your reply.

To answer your questions:

There's a total of 350 top-level folders, split across some 20 subsites (could be modeled as either dotCMS sites or a meta-top level, but different dotCMS sites seems the best approach). So each subsite has between 10 to 50 top-level folders.

Our current model has restricted access to subsites, folders and content. Most is open to all users but we need to enforce access control at all levels. We may want to create a custom interface to manage those 60 roles across our user base of roughly 6,000 IDs.

As for scripting, we are a small team with expertise in database (currently Oracle), middleware and front end. So SQL, ColdFusion, jQuery and JS are easy -- that allows us to create our own APIs and consume dotCMS APIs to load dotCMS, run headless or continue to feed dotCMS from our existing admin back-end as we transition to the native admin tools. The eventual goal is a fully native dotCMS implementation. Current goal is developing the schema and structure for dotCMS as a mirrored data store -- we're not adamant about any of those steps, it will be a bit of a long road as I'm sure you are aware!

We are also comfortable with the vtl environment now that we've had a taste ... we also do some Java which underlays ColdFusion ( PHP-like scripting language), similar I imagine to how dotCMS is actually Java and extensible if necessary. It's an interesting mix because we are most comfortable writing code, but eventually the site will be managed by others through the UI.
 
I'd like to ask about pointers for the following specific questions, especially if those are available in the documentation or other places:

1. Relationships vs. categories for taxonomy and navigation:
    Navigation will be scripted I'm sure
    Ease of folder management with existing back end, or write our own UI for that

2. How to programmatically load and manage:
     ** Hierarchy/Taxonomy (folders) and return identifiers (above suggested as separate import/export steps)
     Metadata categories (region, customer type, status etc.) these are not navigation but used for search filters
        - actually there are only about 60 of these, so can be hand-loaded
     Pages -- I believe a small number of pages, with a detail page for each content type, these will be hand-built I guess
     Templates -- again, a small number, hand-built
     ** Content -- bulk export, with each content item (contentlet?) storing its own parent location (folder) identifier and metadata categories

The ** items are critical.

On another note, I am interacting with a local docker instance, using curl. So far I've managed to use the webhooks example to import and read content. I'd like to use the Postman API examples but haven't had much luck running Postman against my instance, or the demo site.

Thanks all for your help,

- Jon

Jon Ku

unread,
May 9, 2024, 3:52:43 PMMay 9
to dotCMS User Group
A few more specific questions:

What can webhooks do for import -- content only, or categories, relationships, pages?

Same question for CSV import, looks like content can be exported/imported this way ... are there means to import categories or relationships or other types?

Best practices and examples of API for loading and exporting data.

Should I install and start practicing with the CLI, or focus on the API.

Thanks,

- Jon

Reply all
Reply to author
Forward
0 new messages