[Dspace-tech] HTML containing pictures

44 views
Skip to first unread message

Richard Jones

unread,
Aug 24, 2015, 12:15:53 PM8/24/15
to dspac...@lists.sourceforge.net
Hi All,

We have just noticed that uploading an HTML file which references pictures
and also uploading the pictures themselves does not result in a renderable
HTML page when the item is viewed. In retrospect, this is obviously the
case given the way that DSpace stores its files, and after a quick think I
can't see an obvious way around this. Would other people consider this a
problem? Should there be a fix for it or is it the case that we simply
can't take HTML files with images referenced?

There are certain circumstances where a user might export to HTML using Word
which tends to make a folder called myHTMLFile_files, which could
potentially contain images. In fact, on a general scale, uploading an
entire website might be desirable in some circumstances. In either of these
cases, maiming the linking between all the files involved is going to be
non-trivial using the current methodology. Anyone got any thoughts?

Cheers

Richard
==============================
Richard Jones
Systems Developer
Theses Alive! - www.thesesalive.ac.uk
Edinburgh University Library
r.d....@ed.ac.uk
0131 651 1611



Ed Murphy

unread,
Aug 24, 2015, 12:15:55 PM8/24/15
to Richard Jones, dspac...@lists.sourceforge.net
Hi Richard,

We've struggled with this exact problem. We built a Learning Object
Repository based on DSpace. We call our project, DLearn,
https://www.dlearn.arizona.edu.

Our main goal is to have instructors submit items they have used in
their classes, with course mgmt systems like WebCT and Blackboard HTML
pages are likely submissions. We've also had to wrestle with the
problem of instructors wanting to submit whole web sites.

We have not attempted to design, code or implement a solution to this
issue. We deemed it too large of a issue for v1 of DLearn. But, by the
comments and questions we are getting from our first set of testers this
issue is not done yet.

You are right, the current infrastructure of DSpace does not support
such functionality. Here you get into the debate "Should this system be
purely an archive, or should it be a system that serves content?" As
you are probably aware there are supporters of both views. This is a
big topic and one I hope we can discuss as part of the "future of
DSpace" discussion next week at the DSpace Users Group Meeting.

- Ed Murphy

Applications Systems Analyst, Sr.
Integrated Learning Center
The University of Arizona

emur...@email.arizona.edu
> -------------------------------------------------------
> This SF.Net email is sponsored by: IBM Linux Tutorials
> Free Linux tutorial presented by Daniel Robbins, President and CEO of
> GenToo technologies. Learn everything from fundamentals to system
> administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> _______________________________________________
> DSpace-tech mailing list
> DSpac...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech


Hussein Suleman

unread,
Aug 24, 2015, 12:15:57 PM8/24/15
to Richard Jones, dspac...@lists.sourceforge.net
hi

this is an old problem and i havent seen any simple solutions in
packaged systems.

personally, i have an immediate need to archive student projects, which
are essentially hyperlinked websites according to a set of rules that i
laid down (such as: all internal links must be relative). each project
is then submitted as an archived ".zip" or "tar.gz" file.

on my end, i then use a "renderer" that extracts a single file from an
archive dynamically when requested by the web server, using just a
simple URL but with an embedded application somewhere in it. this way,
the server only stores the single archive files for each project, which
greatly simplifies management. ultimately, i will front-end this with a
proper DL and integrate the renderer (i believe EPrints can do this
easily and im sure DSpace can as well) but for now the collection is
small so i use static XML (+XSL) files.

if you want to see it in action, go to
http://pubs.cs.uct.ac.za/honsproj/

ttfn,
----hussein

ps. this works perfectly on my ns7.1 on ms-windows, but i know it doesnt
work on all platforms
> -------------------------------------------------------
> This SF.Net email is sponsored by: IBM Linux Tutorials
> Free Linux tutorial presented by Daniel Robbins, President and CEO of
> GenToo technologies. Learn everything from fundamentals to system
> administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> _______________________________________________
> DSpace-tech mailing list
> DSpac...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech

--
=====================================================================
hussein suleman ~ hus...@cs.uct.ac.za ~ http://www.husseinsspace.com
=====================================================================


JQ Johnson

unread,
Aug 24, 2015, 12:16:01 PM8/24/15
to dspac...@lists.sourceforge.net
Richard Jones writes:

> We have just noticed that uploading an HTML file which references pictures
> and also uploading the pictures themselves does not result in a renderable
> HTML page when the item is viewed. In retrospect, this is obviously the
> case given the way that DSpace stores its files, and after a quick think I
> can't see an obvious way around this. Would other people consider this a
> problem? Should there be a fix for it or is it the case that we simply
> can't take HTML files with images referenced?

Hussein Suleman writes:

>this is an old problem and i havent seen any simple solutions in
>packaged systems.

A number of somewhat unrelated comments:

0/ This has been for a year my single biggest complaint with DSpace.

1/ It's not just images. It's also other support files such as external CSS
style sheets, .JS files, etc. Think of HTML as a compound document format
implemented as multiple separate files.

2/ I believe this problem has been pretty well addressed in both WebCT and
Blackboard, so I have to dissagree with Hussein. Blackboard, for instance,
parses the HTML file when it is uploaded and prompts for each support file
where there's a URL in an IMG, LINK, etc. tag and where the URL is relative
(more precisely, where it is either a simple filename with no slashes, or
has a dirname/filename form, the latter to support HTML exported from MS
Office products). Both WebCT and Blackboard have easy to use and very
workable user interfaces.

3/ On the archival vs display question, I appreciate the philosophical
issues. As a practical matter, though, the fact that DSpace is web based
leads our customers to expect it to be able to display at least simple web
pages. A fortiori, a web page is not an HTML file. It's a set of multiple
URLs in a relationship, and even an archive needs to preserve the
information about that relationship. Introducing an additional packaging
layer by uploading a ZIP or tar file is one way to preserve that
relationship, but it complicates the archiving issues; the "document type"
for supported/known/unknown decisions now needs to encode both the packaging
format and the format of the packaged files.

4/ Per Richard Rogers email to dspace-general of 5 Sep, the DSpace 1.2
release is planned to include "9) Better support for web page (HTML
document) item display." As a practical matter, I see the ability to
display IMGs with relative URLs as a make or break feature for DSpace. If
some minimal support for this doesn't make it into the 1.2 release, I'm
pretty sure that a lot of us -- probably including UO -- will jump ship for
Fedora or Eprints.

5/ There's a risk that pursuit of the perfect might make us lose sight of
"good enough". I don't think that in general it's possible to archive a web
page perfectly. These objects are too dynamic. And the HTML spec is
complex; consider URLs in a CSS encoded in parentheses, or frames, or
site-rooted relative links. I think the ability to handle a simple HTML
document with simple CSS and image files is critical, but I don't think we
need to handle much more of the complexity of HTML than that. If we can
handle the typical HTML document as created by MS Word's "save as web page"
and have it viewable after being uploaded that will definitely be good
enough for my needs.

JQ Johnson Office: 115F Knight Library
Academic Education Coordinator mailto:j...@darkwing.uoregon.edu
1299 University of Oregon phone: 1-541-346-1746; -3485 fax
Eugene, OR 97403-1299 http://darkwing.uoregon.edu/~jqj/


Tansley, Robert

unread,
Aug 24, 2015, 12:16:02 PM8/24/15
to JQ Johnson, dspac...@lists.sourceforge.net
FYI, the HTML support has already been checked into SourceForge CVS if you want to give it a try. You'll probably need to do a fresh install to try it out at the moment, as there were a couple of database schema changes, and we haven't put together any update guide yet.

We went for a fairly simple approach which deals with HTML documents with only relative links (be they to images, other HTML files, style sheets etc.) Basically a new servlet serves up HTML as-is, and if that HTML contains relative links, the resulting URLs the browser will GET can be resolved by that servlet to the appropriate bitstream. This deals with most 'HTML documents' generated by things like MS Office 'save as Web page', latex2html and the like. We decided not to start down the sticky path of parsing/altering the HTML on the way out.

I'll be first to admit that the upload UI needs a little work on this (you have to upload files one at a time.) Any volunteers? :)

Robert Tansley / DSpace Technical Lead / Hewlett-Packard Laboratories

JQ Johnson

unread,
Aug 24, 2015, 12:16:03 PM8/24/15
to Tansley, Robert, dspac...@lists.sourceforge.net
Based on Robert Tansley's description, my guess is that DSpace 1.2 will
contain enough HTML support to meet the minimal needs of most of us.
Good job!

And yes, I volunteer to work on the UI, though probably I won't get
anything done in time for the 1.2 release.

JQ Johnson Office: 115F Knight Library
Academic Education Coordinator e-mail: j...@darkwing.uoregon.edu
1299 University of Oregon 1-541-346-1746 (v); -3485 (fax)
Eugene, OR 97403-1299 http://darkwing.uoregon.edu/~jqj


Ed Murphy

unread,
Aug 24, 2015, 12:16:03 PM8/24/15
to JQ Johnson, Tansley, Robert, dspac...@lists.sourceforge.net
I too am excited to hear that support for web pages is going to be in
DSpace 1.2. This will really help our Learning Object Repository
project. Good job Dspace team!

I also volunteer to work on the UI.

Ed Murphy

Applications Systems Analyst, Sr.
Integrated Learning Center
The University of Arizona
Tucson, AZ 85721

r.d....@ed.ac.uk

unread,
Aug 24, 2015, 12:16:16 PM8/24/15
to dspac...@lists.sourceforge.net
Hi All,

I am looking forward to trying out the 1.2 release and seeing the HTML support
for the repository in action. I think that a solution for relative paths is
going to be the primary requirement, since the use of absolute paths is not
recommended for internal site navigation anyway, and referring to external
sites should probably be maintained in the archive version anyway.

The solution that I sort of had in mind was one where a document can be
archived in such a way that relative links still work when resolved. For
example, if my index.html, which is viewable by in DSpace at the URL:

http://www.era.lib.ed.ac.uk/retrieve/1234/index.html

contains:

<img src="images/myimage.jpg">

then if the url of myimage.jpg were to be:

http://www.era.lib.ed.ac.uk/retrieve/1234/images/myimage.jpg

then it would still work.

Obviously, this requires some work to implement, and some changes to the way
that DSpace represents information internally (I know that we can't just throw
this out overnight, but it may be worth thinking about). I haven't been over
the pros and cons relative to Robert's method, but it might be useful for
other kinds of documents that maintain links between eachother (like XML docs
with styling attached and so forth). Off the top of my head I would suppose
that you need a relational document option at upload which allows the user to
create/upload directory structures containing all the images/css/js, and that
this structure could be regarded as one "file" (in an abstract sense) which is
viewable in its various parts through the same retrieve URL. So documents
might be retrieved like this:

http://www.era.lib.ed.ac.uk/retrieve/r111/index.html
http://www.era.lib.ed.ac.uk/retrieve/r111/styles/mystyle.css
http://www.era.lib.ed.ac.uk/retrieve/r111/javascript/myjs.js
http://www.era.lib.ed.ac.uk/retrieve/r111/images/image1.jpg
http://www.era.lib.ed.ac.uk/retrieve/r111/images/image2.jpg

(I've added the "r" to indicate that this retrieve number refers to a
relational document)

This, of course, makes it possible to run websites through DSpace, and the
security issues of having the possibility of code executed on the server using
this method should probably be addressed.

Cheers

Richard
--
Richard Jones
-------------

Samuel the Librarian

unread,
Dec 6, 2019, 3:42:41 PM12/6/19
to DSpace Technical Support
Hello, All,

I am replying to this post as I have the same questions. We are wanting to archive a website consisting of about 10 web pages linked to hundreds of images and a few .js and .css files. We're currently on DSpace 5.5. Since it's been four years, is it now possible to upload these files so the web pages retain their images and relative links to the other pages? We were able to upload the HTML itself without issue, but without the associated files it is not able to be a complete archive.

I am wondering if there is a way to upload a compressed set of files so they retain their relationship (.zip or .tar), or if there is a better way others out there have used. If none of that works, we will probably have to fall back to PDF files, but it would lose some of its functionality. Any suggestions would be much appreciated!

--

Samuel Willis, MLS

Technology Development Librarian

Wichita State University Libraries

1845 Fairmount St.

Wichita, KS 67260-0068

(316) 978-5104

r.d...@ed.ac.uk
0131 651 1611



Samuel the Librarian

unread,
Jan 7, 2020, 2:28:44 PM1/7/20
to DSpace Technical Support
Hi, All,

Can anyone provide an update on this? Has anyone developed a workflow for ingesting HTML content into DSpace?

Thank you,

--

Samuel Willis, MLS

Technology Development Librarian

Wichita State University Libraries

1845 Fairmount St.

Wichita, KS 67260-0068

(316) 978-5104


Tim Donohue

unread,
Jan 7, 2020, 5:27:12 PM1/7/20
to Samuel the Librarian, DSpace Technical Support
Hi Samuel,

Sorry for the late response (overlooked this when it first came through).  We have some general tips on storing HTML in DSpace in the Documentation itself at https://wiki.lyrasis.org/display/DSDOC6x/Ingesting+HTML+Archives

Generally speaking, only relative links between CSS/images and HTML work OK.  It's been a while since I've tried this out, but it should work as described in those docs.

Tim

From: dspac...@googlegroups.com <dspac...@googlegroups.com> on behalf of Samuel the Librarian <m00se...@gmail.com>
Sent: Tuesday, January 7, 2020 1:28 PM
To: DSpace Technical Support <dspac...@googlegroups.com>
Subject: [dspace-tech] Re: [Dspace-tech] HTML containing pictures
 
--
All messages to this mailing list should adhere to the DuraSpace Code of Conduct: https://duraspace.org/about/policies/code-of-conduct/
---
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-tech...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-tech/46fd9c48-36ad-48e7-b270-944b4f66db4e%40googlegroups.com.

Anne Lawrence

unread,
Jan 10, 2020, 11:49:44 AM1/10/20
to DSpace Technical Support
HTML Content in Items might also be useful.
To unsubscribe from this group and stop receiving emails from it, send an email to dspac...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages