filter-media method

242 views
Skip to first unread message

SAI KUMAR S

unread,
Jun 10, 2024, 6:09:33 AMJun 10
to DSpace Community
Hi All,

I have a query regarding filter-media. I have uploaded around 1000 books to a collection and generated thumbnails for the PDF files using the command line dspace filter-media -f.

However, when I upload another 1000 files to the same collection, I need to generate thumbnails only for the newly uploaded files. I tried using the skip mode by creating a skip-list.txt, but I am not getting the desired result.

Could anyone of you provide me an example of how to correctly use the skip-list.txt method to generate thumbnails?

Alternatively, is there any other method, such as using a script (e.g., Python), to generate the thumbnails for only the newly uploaded files?

Please help me solve this query.


Thanks & Regards
Sai Kumar S

DSpace Community

unread,
Jun 10, 2024, 5:07:15 PMJun 10
to DSpace Community
Hi Sai,

If you run "filter-media" **without** the "-f" flag, then it should automatically skip all Items that already have generated thumbnails.   For example:

./dspace filter-media

When you run it **with** the "-f" flag, that tells the filter-media script to **regenerate all thumbnails**.

For more information see the documentation on this script.

(The "skip list" is only needed if you have files which are consistently throwing errors and you want to *skip them from all future runs* of the "filter-media" script.  But, it shouldn't be necessary in your use case.)

Tim

SAI KUMAR S

unread,
Jun 11, 2024, 12:28:14 AMJun 11
to DSpace Community
Hi Tim

Thank you for the information.

The issue is that when we run the command line ./dspace filter-media, the thumbnail-generated files are also read, but they are skipped. This means the process reads the files from the beginning each time, which takes more time as the number of files increases.

Is there any other method, such as executing a script, for generating thumbnails more efficiently?

Regards
Sai Kumar S

Daan Lessing

unread,
Jun 11, 2024, 8:04:04 AMJun 11
to SAI KUMAR S, DSpace Community
Good morning,

Just a follow-up question on this. Let's say for instance you have to restore an entire database and the assetstore, do you lose all thumbnails and will filter-media have to start building thumbnails from scratch?

I have been running filter-media and it has been running for 3 weeks and not yet completed. 

Looking forward to your response.

Kind regards,
Daan





Mailtrack Sender notified by
Mailtrack
11/06/24, 07:32:08

--
All messages to this mailing list should adhere to the Code of Conduct: https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
---
You received this message because you are subscribed to the Google Groups "DSpace Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-communi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-community/07d120cd-74de-4420-b49d-d3ee6744738an%40googlegroups.com.

SAI KUMAR S

unread,
Jun 11, 2024, 9:58:49 AMJun 11
to Daan Lessing, DSpace Community
Hi Daan,

Thankyou for your reply

As you said if I have to restore an entire database and the assetstore, it depends whether the thumbnail have been generated before taking the backup, or if thumbnail were generated then no need to regenerate the thumbnail from the scratch(I may not be correct, if any information I have given is wrong please correct).

As I wanted to know is that when I keep for generating thumbnail, why it starts from scratch(but the generated thumbnail gets skipped anyways)

I thought is there any other method where already generated thumbnails does not get read and only generates the required(means which does not have thumbnails)

Regards 
Sai Kumar S 

DSpace Community

unread,
Jun 12, 2024, 11:24:20 AMJun 12
to DSpace Community
Hi Sai & Daan,

The filter-media script always loops through all objects to *determine* which ones need to be processed.  This script is in charge of *not only* thumbnails, but also for extracting text for indexing purposes (and any other actions that are enabled as "filter.plugins" in your dspace.cfg).  See the full docs at https://wiki.lyrasis.org/display/DSDOC7x/Mediafilters+for+Transforming+DSpace+Content

So, this script doesn't keep a list of objects which have already had generated thumbnails.  The reason is that, even if a file has a generated thumbnail, it's possible the file needs to be processed by other filters (e.g. for full text indexing the textual content may be extracted).  So, every time you run "filter-media" it will loop through every file...but will skip any files that it notices were already processed (e.g. if the file already has a thumbnail or extracted text, it will not re-generate it unless you use the "-f" flag to force regeneration).   

The "skip mode" (-s flag) concept can also be used to tell it to skip entire communities/collections/items...but then it will never process that object again until it is removed from the skip list.  So, this should be used sparingly unless you are sure the object never will need a new thumbnail or full text indexing, etc.

There are options to process files little by little (using the "-m" or maximum flag) or even process files community-by-community or collection-by-collection (using the "-i" or identifier flag) in order to break down a larger job into smaller chunks.

This is simply how this tool works at this time.  I do agree there may be ways to make it more efficient.  But, we haven't had a developer volunteer to do such work or to redesign the current process.  If you or anyone else out in the community are interested in helping to improve this tool, I'm sure the Committers would welcome ideas.  All code in DSpace is built/support by volunteers and users. We don't have a centralized development team (i.e. I have no developers working for me).

Semi-related this this, there have been past discussions about migrating all media filter scripts/tools into curation tasks (which would allow these processes to be run one-by-one as each new submission is added to DSpace, instead of via the current bulk processing script).  There's some older tickets/PRs related to that, but it has never been finished / found to be fully working.  See https://github.com/DSpace/DSpace/issues/6398 and https://github.com/DSpace/DSpace/pull/1674   (That said, I'd love to see this work completed at some point.)

Tim

SAI KUMAR S

unread,
Jun 13, 2024, 8:53:04 AMJun 13
to DSpace Community
Hi Tim & Daan,

Thank you for your information; it was very helpful.

Regards
Sai Kumar S

Damian Józefowski

unread,
Jun 13, 2024, 9:02:38 AMJun 13
to DSpace Community
Hi,
In one of our projects at PCG we have added an option to this script which allows you to set a date from which the thumbnails will be generated. The date compares item last modified value.
If you are interested, we can prepare a PR with this solution.

Best regards
Damian Józefowski

SAI KUMAR S

unread,
Jun 13, 2024, 9:17:52 AMJun 13
to DSpace Community
Hi Damian Józefowski,

Thankyou 

Can you please share that information which could help us on getting the solution to the issue.

Regards
Sai Kumar S

Daan Lessing

unread,
Jun 14, 2024, 2:35:22 AMJun 14
to DSpace Community
Good morning Tim,

Thank you very much for the detailed explanation, much appreciated.

As you said, migrating all media filter scripts/tools into curation tasks that would allow these processes to be run one by one as each new submission is added to DSpace would be fantastic.

Have a good day.

Kind regards,
Daan




Mailtrack Sender notified by
Mailtrack
14/06/24, 08:32:57

Damian Józefowski

unread,
Jun 17, 2024, 9:32:58 AMJun 17
to DSpace Community
Hi,

Here is the PR https://github.com/DSpace/DSpace/pull/9653. I hope it will solve your problems.

Best regards,
Damian Józefowski

pon., 17 cze 2024 o 11:22 SAI KUMAR S <saikumar1...@gmail.com> napisał(a):
Hi,

Thankyou for your email

Could please share the information for getting the solution.

Thanks & Regards
Sai Kumar S

SAI KUMAR S

unread,
Jun 19, 2024, 8:16:29 AMJun 19
to DSpace Community
Dear  Damian Józefowski , 

Thank you for addressing the problem; your assistance is greatly appreciated.
  
I need to know why thumbnails are not generating for larger file sizes. What could be the issue? - how to resolve this issue.

Thanks & Regards
Sai Kumar S

Reply all
Reply to author
Forward
0 new messages