Hi Jessica,
Good catch, I think you are right. I have not written the import/export module and I don't know this field neither, but in case that David thinks this is really an issue, I think where the problem is:
David, in QubitXmlImport.class.php we replace \n, \r and \s with a space, at lines 349/350:
// normalize the node text (trim whitespace manually); NB: this will strip any child elements, eg. HTML tags$nodeValue = trim(preg_replace('/[\n\r\s]+/', ' ', $domNode2->nodeValue));
So this is a problem with import, the export works as expected. I hope that this information helps to create a detailed issue and to know where to attack when we start it, :-).
Thank you Jessica!
Regards,
2011/2/24 Jessica Bushey <jes...@artefactual.com>
<snip>
But I noticed that the townley.ead.xml file that David supplied to me - that when I imported it into USask test site the representation of data didn't format properly in the "View archival description" Page. Perhaps it is simply that ead.xml file? I imported the townley.ead.xml file into Qubit Trunk and the ICA-AtoM Demo site - it was unformatted in all of them. The only reason why I know that it wasn't properly formatting is because I took a screengrab of the townley fonds "View archival description" page in ICA-AtoM prior to deleting the file and importing the townley.ead.xml file that David supplied me with. I have attached the screengrabs for you, but this might be a total nonissue so please ignore if so.
--
Jesús García Crespo
-- David Juhasz, Software Engineer Artefactual Systems Inc. www.artefactual.com
I may be missing something here but Wu has a customization to deal with
properly importing separate paragraphs. One of the things on my task
list is to get a handle on which of our customizations developed in the
course of figuring out our migration (not yet complete) would make sense
to submit as potential patches, but that would be one of them. At the
moment I believe it's implemented on saindev.usask.ca/ica-atom-fonds but
I'm not sure about the other instances.
So, if there are other formatting issues on our site, possibly a
side-effect of this customization (Jessica mentioned testing on the
USask site), that would be worth knowing about, but if it's a matter of
preserving line breaks as encoded through <p>...</p>, then I think we
have a potential fix.
Tim
On 2/25/2011 1:04 PM, David at Artefactual wrote:
> [Fix subject]
>
> On Feb 25, 11:02 am, David Juhasz<da...@artefactual.com> wrote:
>> Hi,
>>
>> This is a valid issue as far as I'm concerned, so I've forwarded this
>> conversation to the qubit-dev list.
>>
>> Jessica, can you please file an issue for the problem to go with our
>> list of other EAD issues:http://code.google.com/p/qubit-toolkit/issues/list?can=2&q=Component%...
>>
>> Please include the import file and your excellent screenshots with the
>> issue report.
>>
>> Hopefully we can get some time/budget to address all of the outstanding
>> EAD issues in the not-to-distant future.
>>
>> Cheers,
>> David
>>
>>> Jes�s Garc�a Crespo
>> --
>> David Juhasz,
>> Software Engineer
>>
>> Artefactual Systems Inc.www.artefactual.com
--
Tim Hutchinson
University of Saskatchewan Archives
301 Main Library, 3 Campus Drive
Saskatoon, SK S7N 5A4
tel: (306) 966-6028
fax: (306) 966-6040
e-mail: tim.hut...@usask.ca
web: http://www.usask.ca/archives/
I wrote the XML import module, including this particular line of code, so I can at least explain the rationale of this (intended) behaviour — whether it's something we want to keep or not is a separate issue. :-)
Normalizing or preserving whitespace is typically an option on most XML parsers, based on the W3C spec (good summary at: http://www.usingxml.com/Basics/XmlSpace). The behaviour implemented here is "replace" (actually it's incomplete as it should also replace \t characters), but what I think is expected by Jesús below is "collapse" (or preserve).
The nodeValue populated on line 349 was intended to be a convenience variable for elements where the value is expected to be simple, ie. single-line, no child elements (hence the NB comment).
The nodeXML variable contains the full, un-normalized value of the element, including all whitespace characters. Note that the choice of which value to pass through the import is determined by the YAML mapping in /apps/qubit/modules/object/config/import — in this case, ead1.yml.
If the Parameters value is not specified in a mapping, it defaults to the value of nodeValue (which is likely the case here), but it can be specified explicitly as nodeValue, nodeXML, or any of the other populated variables (importDOM and domNode2 are used in several places as well).
eg. assuming the value being populated is "scope and content", then to preserve whitespace and HTML elements like <p> in ead1.yml you would insert after lines 120 and 124:
Parameters: [$nodeXML]
More detail on import mappings is available at: http://qubit-toolkit.org/wiki/index.php?title=XML_import/export#Import
Peter and I have discussed the merits of this import design and the implications upon the Qubit data model. I think we both agree that it is a bit cumbersome, and there are other potential ways we could implement this, but they would require a fairly significant amount of effort. I am hopeful, however, that part of our LAC engagement (and its heavy use of MARC as import format) may add some new insight as to what some of the better options might be.
Hope this helps,
MJ
> --
> You received this message because you are subscribed to the Google Groups "Qubit Toolkit Developers" group.
> To post to this group, send email to qubi...@googlegroups.com.
> To unsubscribe from this group, send email to qubit-dev+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/qubit-dev?hl=en.
>
$nodeValue = trim(preg_replace('/[\n\r\s]+/', ' ', $domNode2->nodeValue));
Should replace tab characters as well, as "\s" matches all whitespace chars:
http://www.php.net/manual/en/regexp.reference.escape.php
Thanks for the insightful follow-up post MJ. I certainly wasn't aware of
the conventions regarding preserving and collapsing whitespace in XML
parsing, so it was a very helpful explanation to me.
So it does. Guess I did it right after all. :-)
MJ
Yes, we'll definitely follow the correct process for patches :) As I
mentioned, having a look at potential patches (and identifying other
issues for that matter) based on my work on the import is on my to-do
list - but it's lower priority right now than the other development as
well as actually getting the migration done. At this point I can at
least comment on the issue report, when Jessica files it, to flag that
we may be able to contribute.
Our customization does relate to EAD import. Basically, the
customization strips out the <p>'s and replaces the </p>'s with \n\n,
along with the other normalizations, so that separate paragraphs in the
source XML file are retained when they're imported into ica-atom - and
consistent with the way they would be formatted had they been entered
directly, as opposed to retaining the <p>'s.
There are definitely some related (and more complicated) issues, which
at this point we've worked around by modifying the files exported from
the legacy system. One example that comes to mind is child elements of
<physdec> - <extent>, <dimensions>, etc. Right now the EAD files need to
be relatively non-granular; another example (known issue) is repeated
elements which result in lost data.
Tim