Re: Patch 1132 Tested

7 views
Skip to first unread message

David Juhasz

unread,
Feb 25, 2011, 2:02:23 PM2/25/11
to Jesús García Crespo, Jessica Bushey, qubi...@googlegroups.com
Hi,

This is a valid issue as far as I'm concerned, so I've forwarded this conversation to the qubit-dev list. 

Jessica, can you please file an issue for the problem to go with our list of other EAD issues:
http://code.google.com/p/qubit-toolkit/issues/list?can=2&q=Component%3DEAD

Please include the import file and your excellent screenshots with the issue report.

Hopefully we can get some time/budget to address all of the outstanding EAD issues in the not-to-distant future.

Cheers,
David

On 11-02-24 12:35 AM, Jesús García Crespo wrote:
Hi Jessica,

Good catch, I think you are right. I have not written the import/export module and I don't know this field neither, but in case that David thinks this is really an issue, I think where the problem is:

David, in QubitXmlImport.class.php we replace \n, \r and \s with a space, at lines 349/350:

// normalize the node text (trim whitespace manually); NB: this will strip any child elements, eg. HTML tags
$nodeValue = trim(preg_replace('/[\n\r\s]+/', ' ', $domNode2->nodeValue));

So this is a problem with import, the export works as expected. I hope that this information helps to create a detailed issue and to know where to attack when we start it, :-).

Thank you Jessica!

Regards,

2011/2/24 Jessica Bushey <jes...@artefactual.com>
<snip>

But I noticed that the townley.ead.xml file that David supplied to me - that when I imported it into USask test site the representation of data didn't format properly in the "View archival description" Page. Perhaps it is simply that ead.xml file? I imported the townley.ead.xml file into Qubit Trunk and the ICA-AtoM Demo site - it was unformatted in all of them. The only reason why I know that it wasn't properly formatting is because I took a screengrab of the townley fonds  "View archival description" page in ICA-AtoM prior to deleting the file and importing the townley.ead.xml file that David supplied me with. I have attached the screengrabs for you, but this might be a total nonissue so please ignore if so.

--
Jesús García Crespo


-- 
David Juhasz,
Software Engineer

Artefactual Systems Inc.
www.artefactual.com

David at Artefactual

unread,
Feb 25, 2011, 2:04:25 PM2/25/11
to Qubit Toolkit Developers
[Fix subject]

On Feb 25, 11:02 am, David Juhasz <da...@artefactual.com> wrote:
> Hi,
>
> This is a valid issue as far as I'm concerned, so I've forwarded this
> conversation to the qubit-dev list.
>
> Jessica, can you please file an issue for the problem to go with our
> list of other EAD issues:http://code.google.com/p/qubit-toolkit/issues/list?can=2&q=Component%...
>
> Please include the import file and your excellent screenshots with the
> issue report.
>
> Hopefully we can get some time/budget to address all of the outstanding
> EAD issues in the not-to-distant future.
>
> Cheers,
> David
>
> On 11-02-24 12:35 AM, Jesús García Crespo wrote:
>
>
>
> > Hi Jessica,
>
> > Good catch, I think you are right. I have not written the
> > import/export module and I don't know this field neither, but in case
> > that David thinks this is really an issue, I think where the problem is:
>
> > David, in QubitXmlImport.class.php we replace \n, \r and \s with a
> > space, at lines 349/350:
>
> > // normalize the node text (trim whitespace manually); NB: this will
> > strip any child elements, eg. HTML tags
> > $nodeValue = trim(preg_replace('/[\n\r\s]+/', ' ', $domNode2->nodeValue));
>
> > So this is a problem with import, the export works as expected. I hope
> > that this information helps to create a detailed issue and to know
> > where to attack when we start it, :-).
>
> > Thank you Jessica!
>
> > Regards,
>
> > 2011/2/24 Jessica Bushey <jess...@artefactual.com
> > <mailto:jess...@artefactual.com>>

Tim Hutchinson

unread,
Feb 25, 2011, 2:24:47 PM2/25/11
to qubi...@googlegroups.com
Hi all,

I may be missing something here but Wu has a customization to deal with
properly importing separate paragraphs. One of the things on my task
list is to get a handle on which of our customizations developed in the
course of figuring out our migration (not yet complete) would make sense
to submit as potential patches, but that would be one of them. At the
moment I believe it's implemented on saindev.usask.ca/ica-atom-fonds but
I'm not sure about the other instances.

So, if there are other formatting issues on our site, possibly a
side-effect of this customization (Jessica mentioned testing on the
USask site), that would be worth knowing about, but if it's a matter of
preserving line breaks as encoded through <p>...</p>, then I think we
have a potential fix.

Tim

On 2/25/2011 1:04 PM, David at Artefactual wrote:
> [Fix subject]
>
> On Feb 25, 11:02 am, David Juhasz<da...@artefactual.com> wrote:
>> Hi,
>>
>> This is a valid issue as far as I'm concerned, so I've forwarded this
>> conversation to the qubit-dev list.
>>
>> Jessica, can you please file an issue for the problem to go with our
>> list of other EAD issues:http://code.google.com/p/qubit-toolkit/issues/list?can=2&q=Component%...
>>
>> Please include the import file and your excellent screenshots with the
>> issue report.
>>
>> Hopefully we can get some time/budget to address all of the outstanding
>> EAD issues in the not-to-distant future.
>>
>> Cheers,
>> David
>>

>>> Jes�s Garc�a Crespo


>> --
>> David Juhasz,
>> Software Engineer
>>
>> Artefactual Systems Inc.www.artefactual.com


--
Tim Hutchinson
University of Saskatchewan Archives
301 Main Library, 3 Campus Drive
Saskatoon, SK S7N 5A4
tel: (306) 966-6028
fax: (306) 966-6040
e-mail: tim.hut...@usask.ca
web: http://www.usask.ca/archives/

MJ Suhonos

unread,
Feb 25, 2011, 2:43:33 PM2/25/11
to qubit-dev
Hi all,

I wrote the XML import module, including this particular line of code, so I can at least explain the rationale of this (intended) behaviour — whether it's something we want to keep or not is a separate issue. :-)

Normalizing or preserving whitespace is typically an option on most XML parsers, based on the W3C spec (good summary at: http://www.usingxml.com/Basics/XmlSpace). The behaviour implemented here is "replace" (actually it's incomplete as it should also replace \t characters), but what I think is expected by Jesús below is "collapse" (or preserve).

The nodeValue populated on line 349 was intended to be a convenience variable for elements where the value is expected to be simple, ie. single-line, no child elements (hence the NB comment).

The nodeXML variable contains the full, un-normalized value of the element, including all whitespace characters. Note that the choice of which value to pass through the import is determined by the YAML mapping in /apps/qubit/modules/object/config/import — in this case, ead1.yml.

If the Parameters value is not specified in a mapping, it defaults to the value of nodeValue (which is likely the case here), but it can be specified explicitly as nodeValue, nodeXML, or any of the other populated variables (importDOM and domNode2 are used in several places as well).

eg. assuming the value being populated is "scope and content", then to preserve whitespace and HTML elements like <p> in ead1.yml you would insert after lines 120 and 124:

Parameters: [$nodeXML]

More detail on import mappings is available at: http://qubit-toolkit.org/wiki/index.php?title=XML_import/export#Import

Peter and I have discussed the merits of this import design and the implications upon the Qubit data model. I think we both agree that it is a bit cumbersome, and there are other potential ways we could implement this, but they would require a fairly significant amount of effort. I am hopeful, however, that part of our LAC engagement (and its heavy use of MARC as import format) may add some new insight as to what some of the better options might be.

Hope this helps,
MJ

> --
> You received this message because you are subscribed to the Google Groups "Qubit Toolkit Developers" group.
> To post to this group, send email to qubi...@googlegroups.com.
> To unsubscribe from this group, send email to qubit-dev+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/qubit-dev?hl=en.
>


David Juhasz

unread,
Feb 25, 2011, 3:09:18 PM2/25/11
to qubi...@googlegroups.com
On 11-02-25 11:43 AM, MJ Suhonos wrote:
> Normalizing or preserving whitespace is typically an option on most XML parsers, based on the W3C spec (good summary at: http://www.usingxml.com/Basics/XmlSpace). The behaviour implemented here is "replace" (actually it's incomplete as it should also replace \t characters), but what I think is expected by Jesús below is "collapse" (or preserve).
Just a quick note:

$nodeValue = trim(preg_replace('/[\n\r\s]+/', ' ', $domNode2->nodeValue));


Should replace tab characters as well, as "\s" matches all whitespace chars:
http://www.php.net/manual/en/regexp.reference.escape.php

David Juhasz

unread,
Feb 25, 2011, 3:17:28 PM2/25/11
to qubi...@googlegroups.com
And my *first* comment should have been:

Thanks for the insightful follow-up post MJ. I certainly wasn't aware of
the conventions regarding preserving and collapsing whitespace in XML
parsing, so it was a very helpful explanation to me.

MJ Suhonos

unread,
Feb 25, 2011, 3:20:22 PM2/25/11
to qubi...@googlegroups.com
>> Normalizing or preserving whitespace is typically an option on most XML parsers, based on the W3C spec (good summary at: http://www.usingxml.com/Basics/XmlSpace). The behaviour implemented here is "replace" (actually it's incomplete as it should also replace \t characters), but what I think is expected by Jesús below is "collapse" (or preserve).
> Just a quick note:
>
> $nodeValue = trim(preg_replace('/[\n\r\s]+/', ' ', $domNode2->nodeValue));
>
> Should replace tab characters as well, as "\s" matches all whitespace chars:
> http://www.php.net/manual/en/regexp.reference.escape.php

So it does. Guess I did it right after all. :-)

MJ

David at Artefactual

unread,
Feb 25, 2011, 3:29:43 PM2/25/11
to Qubit Toolkit Developers
Hi Tim,

Just to be clear, the formatting issue that Jessica reported only
occurs when importing an EAD XML file - I'm not sure if your in-house
patch addresses that particular use-case, or if it's a more general
patch for displaying HTML <p> tags? In either case, your patch
sounds like a good candidate for applying to the Qubit trunk. I'm
very certain that Wu's patch for issue 1132 had nothing to do with
this formatting issue - MJ's post explains the rationale behind the
current import code, and it was implemented long before Wu's
patch. :)

Our preferred method for submitting the patch would be to attach a
patch file to the issue report (which Jessica will file next week);
if your particular patch is not specific to EAD import, then creating
a new issue report for the original problem and attaching your
suggested patch would be most welcome. :)

Cheers,
David

Tim Hutchinson

unread,
Feb 25, 2011, 3:53:12 PM2/25/11
to qubi...@googlegroups.com
Hi David,

Yes, we'll definitely follow the correct process for patches :) As I
mentioned, having a look at potential patches (and identifying other
issues for that matter) based on my work on the import is on my to-do
list - but it's lower priority right now than the other development as
well as actually getting the migration done. At this point I can at
least comment on the issue report, when Jessica files it, to flag that
we may be able to contribute.

Our customization does relate to EAD import. Basically, the
customization strips out the <p>'s and replaces the </p>'s with \n\n,
along with the other normalizations, so that separate paragraphs in the
source XML file are retained when they're imported into ica-atom - and
consistent with the way they would be formatted had they been entered
directly, as opposed to retaining the <p>'s.

There are definitely some related (and more complicated) issues, which
at this point we've worked around by modifying the files exported from
the legacy system. One example that comes to mind is child elements of
<physdec> - <extent>, <dimensions>, etc. Right now the EAD files need to
be relatively non-granular; another example (known issue) is repeated
elements which result in lost data.

Tim

Jessica Bushey

unread,
Mar 1, 2011, 3:26:06 PM3/1/11
to Qubit Toolkit Developers
I have update existing Issue 719 to reflect this thread.
Jessica Bushey
> e-mail: tim.hutchin...@usask.ca
> web:http://www.usask.ca/archives/
Reply all
Reply to author
Forward
0 new messages