Exclude Text bundles from Google Scholar indexing

24 views
Skip to first unread message

Bill Tantzen

unread,
Feb 3, 2026, 12:04:49 PM (2 days ago) Feb 3
to DSpace Technical Support
We are discovering extracted text, from the TEXT bundle indexed in Google Scholar.  I'm not sure how this is happening.  bitstreams in the TEXT bundle are referenced numerous times in the <script> element of the source code, but not in the UI so far as I can tell.

Is there a way to prevent these bitstreams from being indexed?

Thanks for any tips!
~~Bill

--
______________________________________
Bill Tantzen    University of Minnesota Libraries
612-626-9949 (U of M)  612-325-1777 (mobile)

DSpace Technical Support

unread,
Feb 3, 2026, 5:44:22 PM (2 days ago) Feb 3
to DSpace Technical Support
Hi Bill,

I have to admit, I find this confusing too.  I'm also not aware of anywhere in the UI where we provide a *publicly available* link to files in the TEXT bundle.  If there is such a way that we are "exposing" the TEXT bundle to crawlers, then it's accidental.  Files in that TEXT bundle are not meant for public downloads.

Are you able to get any clues from which Google Scholar regarding which Items are linking to TEXT bundles?  Are they all newer content, or older content?  If older content, it's always possible this was a bug in an older version of DSpace.  If newer content, that implies maybe we're missing a place these are exposed in recent DSpace versions...that'd imply though that they'd be in the HTML *somewhere*, likely either on the Item page or the "Full" Item page.  (I'm not seeing them on either of those pages on our demo site though, e.g. https://demo.dspace.org/items/bb3eb3d2-9796-4a6b-b08e-af914e2438a9 or https://demo.dspace.org/items/bb3eb3d2-9796-4a6b-b08e-af914e2438a9/full ).  Either that, or Google Scholar's bot is finding links to them elsewhere on the web (which would be odd).

Overall, I think this might require digging for more clues...or (as you've already done) seeing if others have seen this behavior as well.  Either one might help us narrow things down.

Tim

Bill Tantzen

unread,
Feb 3, 2026, 5:55:43 PM (2 days ago) Feb 3
to DSpace Technical Support
Same here. Nothing in the UI, but when I view the source, I see 14 instances of each TEXT bitstream there.  Perhaps google has learned to parse them from there?  I cannot find a trace of them anywhere else.
~~Bill

--
All messages to this mailing list should adhere to the Code of Conduct: https://lyrasis.org/code-of-conduct/
---
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dspace-tech/e642e6b8-7e83-4460-bb71-0879627bf17dn%40googlegroups.com.

DSpace Technical Support

unread,
Feb 3, 2026, 6:04:30 PM (2 days ago) Feb 3
to DSpace Technical Support
Hi Bill,

Where are you seeing these "14 instances of each TEXT bitstream" in the source of the HTML?  I'm not seeing that behavior on the demo site...or maybe I'm misunderstanding how to find it (or overlooking it)? 

Can you see those same references by viewing the source of an Item on the demo site?  For example: https://demo.dspace.org/items/bb3eb3d2-9796-4a6b-b08e-af914e2438a9

It is very possible that Google's crawlers are finding it in the HTML source if it's there.  Keep in mind though that (at least as far as I'm aware) Google Scholar's bots are only accessing the SSR (server side rendered) version of the page, i.e. the page you'd see if you turn off Javascript in your browser.

Tim

Bill Tantzen

unread,
Feb 3, 2026, 6:19:02 PM (2 days ago) Feb 3
to DSpace Technical Support

Andrew K

unread,
Feb 3, 2026, 7:37:41 PM (2 days ago) Feb 3
to DSpace Technical Support
I confirm this issue in 9.1
Sometimes Google Scholar indexes .pdf.txt files extracted from the original .pdf
Everything as described.

середа, 4 лютого 2026 р. о 01:19:02 UTC+2 Bill Tantzen пише:

Sascha Szott

unread,
Feb 4, 2026, 3:02:32 AM (yesterday) Feb 4
to dspac...@googlegroups.com
Hello everyone,

just a small note regarding the discussion: we already talked about the
topic of bitstreams in the TEXT bundle in a developer meeting last year.

This resulted in the GitHub ticket

https://github.com/DSpace/DSpace/issues/11681.

Presumably, we can restrict access to the bitstreams in the TEXT bundle.
Ideally, the URLs should not appear in the SSR output at all.

Best
Sascha

Am 04.02.26 um 01:37 schrieb Andrew K:
> I confirm this issue in 9.1
> Sometimes Google Scholar indexes .pdf.txt files extracted from the
> original .pdf
> Everything as described.
>
> середа, 4 лютого 2026 р. о 01:19:02 UTC+2 Bill Tantzen пише:
>
> for example,
>
> see https://conservancy.umn.edu/items/8cbfee50-2287-49d7-
> a619-0a6bcdf0b7f8 <https://conservancy.umn.edu/
> items/8cbfee50-2287-49d7-a619-0a6bcdf0b7f8>
>
> search the source for 5cf5fe66-0954-4c8e-839c-cbef34394347, and
> extracted text bitstream.  it is in the <script> element.
>
> It can be found in google scholar at
>
> https://scholar.google.com/scholar?
> hl=en&as_sdt=0%2C24&q=%22MINNESOTA+GEOLOGICAL+SURVEY+DAVID+L.
> +SOUTHWICK%2C+DIRECTOR+BULLETIN+48+FREDERICK+WILLIAM+SARDESON%2C+GEOLOGIST%22&btnG= <https://scholar.google.com/scholar?hl=en&as_sdt=0%2C24&q=%22MINNESOTA+GEOLOGICAL+SURVEY+DAVID+L.+SOUTHWICK%2C+DIRECTOR+BULLETIN+48+FREDERICK+WILLIAM+SARDESON%2C+GEOLOGIST%22&btnG=>
>
>
>

José Carvalho

unread,
Feb 4, 2026, 3:24:33 AM (yesterday) Feb 4
to Sascha Szott, dspac...@googlegroups.com
Dear all,

There are also other places where the TXT information is available (yet, not sure if Google Scholar uses this) on the OAI-PMH interface. 



Regards,

José Carvalho

Coordenador da Unidade de Negócios da Comunicação Científica
Scientific Communication Business Unit Coordinator

Facebook

Linkedin

Twitter

Instagram

(+351) 253 066 735
in...@keep.pt
www.keep.pt
Rua Rosalvo de Almeida 5, Braga, Portugal
TAB Magazine



--
All messages to this mailing list should adhere to the Code of Conduct: https://lyrasis.org/code-of-conduct/
---
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.

Andrew K

unread,
Feb 4, 2026, 11:09:34 AM (21 hours ago) Feb 4
to DSpace Technical Support
Hello,

It looks like the extracted text is intended for internal search, right?
Then it should never be exposed.

WBR,
Andrew

середа, 4 лютого 2026 р. о 10:02:32 UTC+2 Sascha Szott пише:

DSpace Technical Support

unread,
Feb 4, 2026, 3:58:35 PM (16 hours ago) Feb 4
to DSpace Technical Support
Hi All,

Thanks for the additional details everyone.  This sounds like it's occurring in several institutions, which definitely implies this is a more widespread issue in Google Scholar's indexing of DSpace sites.

Regarding the TEXT bitstream URL appearing in the <script> tag: I'm seeing what you mean, Bill.  Now that I look closely, I'm seeing it also on our demo site.  There does seem to be some extraneous JSON in that <script> tag that looks like cached responses from the REST backend.... I'm not exactly sure where that's coming from, and it *does* seem to sometimes include the URL of the TEXT bundle file.

My guess would be that might be where Google Scholar is finding the link, but I cannot say with any certainty.  They obviously don't share all the information on how they index sites.  But, I do know that Google Scholar uses the SSR (server side rendered) HTML page. Their bots don't use OAI or anything else like that.

I'll bring this up in tomorrow's DSpace Developers Meeting to see if anyone has brainstorms on a possible fix.  It sounds like either we need to find what is adding that extraneous JSON (and it could be something in Angular), or we may need to re-prioritize a fix for the TEXT bundle permissions discussion (that Sascha noted) that was logged in https://github.com/DSpace/DSpace/issues/11681.

Tim

Reply all
Reply to author
Forward
0 new messages