Redindexing in fc5 after changing the persistence store

60 views
Skip to first unread message

Ralf Neugebauer

unread,
Oct 6, 2020, 5:11:47 AM10/6/20
to Fedora Community
Hello,

I've worked a long time with FC 3.8.1 and there was a tool fedora_rebuild.sh to synchronize the database with the saved files.
I used this scenario very often when a customer got a new server and I have to rebuild FC.

How can I do this in FC version 5?
I see a lot of scripts in Python and I played with it, but I can't find a way to rebuild FC.

My first installation was with persistence store on filesystem, tahn I inserted some data in in and I finally switched to a postresql database.
And now I want to rebuild it.

Thanks for your suggestions.

Kind reagrds
 Ralf 

Andrew Woods

unread,
Oct 6, 2020, 10:44:40 AM10/6/20
to Fedora Community
* Copied from Slack (http://slack.fcrepo.org/)

Hello @Ralf Neugebauer,
Fedora 4 and 5 do not have the persistence/caching model found in Fedora 3. Therefore, in Fedora 4 and 5 there is no "rebuild" functionality: the persisted "objects/files" are found in pairtree-like directories on the filesystem and the "metadata" is found in your database.
However, the almost-alpha Fedora 6 brings back the sensibilities that you are used to in Fedora 3. Specifically, all of the objects and metadata are persisted to the filesystem in a transparent, specified, self-describing manner, and the database cache can be rebuilt on start-up from what is on disk.

Regards,
Andrew

________________________________________
From: fedora-c...@googlegroups.com <fedora-c...@googlegroups.com> on behalf of Ralf Neugebauer <rneugebauerat...@gmail.com>
Sent: Tuesday, October 6, 2020 5:11 AM
To: Fedora Community
Subject: [fedora-community] Redindexing in fc5 after changing the persistence store
--
You received this message because you are subscribed to the Google Groups "Fedora Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fedora-communi...@googlegroups.com<mailto:fedora-communi...@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/fedora-community/91ac6419-84a6-4daa-a7be-e0cd06741f30n%40googlegroups.com<https://groups.google.com/d/msgid/fedora-community/91ac6419-84a6-4daa-a7be-e0cd06741f30n%40googlegroups.com?utm_medium=email&utm_source=footer>.

dc...@prosentient.com.au

unread,
Oct 6, 2020, 5:59:26 PM10/6/20
to Andrew Woods, fedora-c...@googlegroups.com
Hi Andrew,

Can you just clarify that last point? The database *can* be rebuilt on start-up, but it's not *always* rebuilt, right? That would be an expensive operation (financially and computationally) if using third-party object storage with many objects.

David Cook
Software Engineer
Prosentient Systems
72/330 Wattle St
Ultimo, NSW 2007
Australia

Office: 02 9212 0899
Online: 02 8005 0595
To unsubscribe from this group and stop receiving emails from it, send an email to fedora-communi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/fedora-community/CY4PR2201MB17201FD2D3B32335D3098600840D0%40CY4PR2201MB1720.namprd22.prod.outlook.com.


Andrew Woods

unread,
Oct 6, 2020, 6:02:17 PM10/6/20
to dc...@prosentient.com.au, fedora-c...@googlegroups.com
Hello David,

Yes... the operative term is indeed *can*.

When Fedora6 starts up, it checks to see if the database is empty; if so, it rebuilds. Although we have not yet, it may make sense to provide the ability for the user to invoke a rebuild... noting that truncating the database has the same effect.

Regards,
Andrew

________________________________________
From: dc...@prosentient.com.au <dc...@prosentient.com.au>
Sent: Tuesday, October 6, 2020 5:59 PM
To: Andrew Woods; fedora-c...@googlegroups.com
Subject: RE: [fedora-community] Redindexing in fc5 after changing the persistence store

dc...@prosentient.com.au

unread,
Oct 6, 2020, 6:12:54 PM10/6/20
to Andrew Woods, fedora-c...@googlegroups.com
Thanks for elaborating on the rebuild process. That's helpful.

I'm not intimately familiar with F6, but is there anything else stored in
the database other than object metadata? I wonder if there are scenarios
where you might manually update the object storage and want to rebuild the
object metadata without nuking the whole database. Or is a total rebuild the
only option and smaller updates have to be done only via an API?
fedora-communi...@googlegroups.com<mailto:fedora-community+unsubs
cr...@googlegroups.com>.

Andrew Woods

unread,
Oct 7, 2020, 10:26:24 AM10/7/20
to dc...@prosentient.com.au, fedora-c...@googlegroups.com
Hello David,

The F6 database is strictly a cache of metadata that is stored in the underlying OCFL persistence. There is nothing in the database that can not be rebuilt from what is on-disk in OCFL.

Thanks for raising the question of more targeted rebuild scenarios. Given the fact that Fedora's expectations for what a Fedora resource looks like in OCFL is documented and specified [1], there is interest in being able to add new objects directly into Fedora's underlying OCFL, circumventing the Fedora layer. We have been terming this technique: "side-loading".

Although it has not yet been implemented [2], we plan on exposing an HTTP API for indicating an OCFL object that has been side-loaded, and initiating the rebuild/ingest of that object into Fedora.

This functionality will support Fedora recognizing new, side-loaded objects. I do not believe Fedora will support users updating existing OCFL objects without a full rebuild.

It would be great to hear more about your use-cases and your experiences with early testing of F6.

Regards,
Andrew
[1] https://wiki.lyrasis.org/display/FF/Design+-+Fedora+OCFL+Object+Structure
[2] https://jira.lyrasis.org/browse/FCREPO-3332

________________________________________
From: dc...@prosentient.com.au <dc...@prosentient.com.au>
Sent: Tuesday, October 6, 2020 6:12 PM

dc...@prosentient.com.au

unread,
Oct 7, 2020, 6:32:39 PM10/7/20
to fedora-c...@googlegroups.com, Andrew Woods
Thanks, Andrew. That "side-loading" technique sounds interesting.

I'd have to put more thought into it, but we do have a variety of ingest use cases. Sometimes, we want to just expose an API endpoint that third-parties can use for uploading objects, but sometimes we do want to do behind-the-scenes batch imports of large numbers (thousands) of large binary objects (1GB-1TB in size). It would be great if we could just upload those binaries (and OCFL metadata) into Object Storage and then tell Fedora to add them to its collection. In theory, that would a lot of save time and energy for ingests, and also allow us to take advantage of the latest tools for working with storage backends (I curse Modeshape's out-dated use of AWS S3 client tools.)

I do have some commentary about the HTTP API for indicating an OCFL object has been side-loaded, but I'm happy to add that to Jira. Basically, it's just that a batch option should be available, so that you could do 1 HTTP POST with many IDs. One of my pain points when working with Fedora 4 has been that it's often very slow to work with just because of the high volume of HTTP calls I need to make. In some cases, I've collocated apps on the same server as Fedora to reduce the network overhead, and while that has provided huge performance gains... it's not very scalable nor is it that "cloud native" friendly. I've found it's made deployments much more burdensome, since I have to update the whole Fedora server when I want to just change 1 little application that works with it. But I'll add that to the Jira. Thanks for linking it!

I suppose once a Fedora resource is in place it shouldn't really need anything more than metadata updates I suppose. If a person needed to update an object, I suppose they could always do a delete via the Fedora API and then side-load again?
fedora-community+subs
cr...@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/fedora-community/91ac6419-84a6-4daa-a7be-e
0cd06741f30n%40googlegroups.com<https://groups.google.com/d/msgid/fedora-com
munity/91ac6419-84a6-4daa-a7be-e0cd06741f30n%40googlegroups.com?utm_medium=e
mail&utm_source=footer>.

--
You received this message because you are subscribed to the Google Groups "Fedora Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fedora-communi...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/fedora-community/CY4PR2201MB17201FD2D3B323
35D3098600840D0%40CY4PR2201MB1720.namprd22.prod.outlook.com.




--
You received this message because you are subscribed to the Google Groups "Fedora Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fedora-communi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/fedora-community/BN6PR2201MB1716C96339AD3EF8753712C0840A0%40BN6PR2201MB1716.namprd22.prod.outlook.com.


Ralf Claussnitzer

unread,
Oct 16, 2020, 3:39:29 AM10/16/20
to fedora-c...@googlegroups.com
Hi David, hi all,

> Sometimes, we want to just expose an API endpoint that third-parties can use for uploading objects,

This totally sounds like a use case for a SWORD API:
http://swordapp.org/swordv3/

> but sometimes we do want to do behind-the-scenes batch imports of large numbers (thousands) of large binary objects (1GB-1TB in size). It would be great if we could just upload those binaries (and OCFL metadata) into Object Storage and then tell Fedora to add them to its collection.
I'm not sure about the "side-loading" use case. The Fedora API
guarantees data consistency. Using another client to change the data
without Fedora noticing sounds like a delicate scenario. Especially when
checksum are in place to guaranty the data is still intact and has not
been tampered with. From Fedoras perspective this is unacceptable and it
should enforce a database rebuild.

> I do have some commentary about the HTTP API for indicating an OCFL object has been side-loaded, but I'm happy to add that to Jira. Basically, it's just that a batch option should be available, so that you could do 1 HTTP POST with many IDs.
It is certainly a good idea to improve the API so that it is better
suited for automation. I'm not sure whether this is already possible
through command line tooling?

> If a person needed to update an object, I suppose they could always do a delete via the Fedora API and then side-load again?
Or they would just do an update and keep the old version. Together with
an audit record documenting this activity. ;-)

Regards,
Ralf

Rastislav Hudak

unread,
Oct 29, 2020, 4:01:39 AM10/29/20
to fedora-c...@googlegroups.com
Hi Ralf!

On Fri, 16 Oct 2020 at 09:39, Ralf Claussnitzer <ralf.cla...@slub-dresden.de> wrote:
Hi David, hi all,

> Sometimes, we want to just expose an API endpoint that third-parties can use for uploading objects,

This totally sounds like a use case for a SWORD API:
http://swordapp.org/swordv3/

> but sometimes we do want to do behind-the-scenes batch imports of large numbers (thousands) of large binary objects (1GB-1TB in size). It would be great if we could just upload those binaries (and OCFL metadata) into Object Storage and then tell Fedora to add them to its collection.
I'm not sure about the "side-loading" use case.

Well I think side-loading is a great feature, a step towards a less monolithic and more flexible solution which answers some real-world problems. Maybe I can answer some of your concerns:
 
The Fedora API
guarantees data consistency.

Yes, though, whatever Fedora does to guarantee consistency, it's doing it *after* the data are in place. The main point of side loading (at least for me) is to get the data in place without pushing it through the web server, which is completely unnecessary and makes bulk uploads a pain (and also no more a behind-the-scenes process, since everybody can see the server got slower :))

There's the other loosely related point about migrating objects between OCFL repositories, hypothetically.

Using another client to change the data
without Fedora noticing sounds like a delicate scenario.

That's where the "HTTP API for indicating an OCFL object that has been side-loaded" which Andrew was talking about comes in, to notify Fedora of new objects it should take under its wings.
 
Especially when
checksum are in place to guaranty the data is still intact and has not
been tampered with. From Fedoras perspective this is unacceptable and it
should enforce a database rebuild.

Right, but not a rebuild of the whole database, at least I hope that's not what Fedora 6 is going to do after a new object was uploaded... ;)
 
Reply all
Reply to author
Forward
0 new messages