[Dspace-tech] Duplicate metadata fields on export

41 views
Skip to first unread message

Isaak, David C

unread,
Aug 26, 2015, 9:33:55 AM8/26/15
to dspac...@lists.sourceforge.net
Hello,
 
On a recent export of my DSpace metadata to CSV, I encountered a problem.
 
I upload all metadata through the CLI using the Simple Archive Format. Typically, when I export my metadata to a CSV, my metadata fields have a language code in brackets like this:
id,collection,dc.contributor.author[en_US],dc.title[en_US],dc.identifier.uri
 
On my most recent export, I now have two additional metadata columns. They are the same as two existing columns (dc.description.abstract[en_US] and dc.identifier.issn[en_US]), but without the language qualifier*. The two new columns are completely blank (according to Excel and Notepad++). I am not sure how they appeared or how to get rid of them. Any suggestions would be appreciated.
 
*The new columns appear as dc.description.abstract not dc.description.abstract[], which is the case when I enter metadata through the graphical user interface without a language code. This type of column repetition can be fixed by consolidating the columns and using Batch Metadata Editing. In my current situation though, there is no data in dc.description.abstract, so the Batch Metadata Editor does not detect any changes.
 
Sincerely,
David
---------
David Isaak
Digital Projects Librarian
Kaiser Permanente
Center for Health Research
3800 N. Interstate Ave.
Portland, OR 97227
(503) 335-2437
---------
kpchr.org
 
 
 

helix84

unread,
Aug 26, 2015, 9:33:58 AM8/26/15
to Isaak, David C, dspac...@lists.sourceforge.net
Hi David,

what you're describing is a pretty common situation. You'll get the
"xyz[]" column if you add a metadata value with an empty string for
language (the "text_lang" column of the "metadatavalue" table). You're
also completely correct about the way to resolve it using Batch
Metadata Editing (BME).

I also encountered the "xyz" column, although I don't exactly remember
how I got it. IIRC, you can see it when the "text_lang" column has a
NULL in that row.

It seems strange that you would get such column in BME if there were
no values in it. I can think of these options:
a) is that an export of your whole repository or only a part of it
(e.g. collection)? If it's only a part, maybe the null is in an item
that is not part of your current export.
b) if BME doesn't detect any changes, do it in two turns: first, put a
bogus value anywhere in that column and import it; second, re-export,
remove the bogus value and import it.

You can also take a look at the database to understand what's going
on. Try these queries:
SELECT text_lang FROM metadatavalue GROUP BY text_lang;
SELECT count(*) FROM metadatavalue WHERE text_lang IS NULL;
SELECT count(*) FROM metadatavalue WHERE text_lang = 'en_US';
SELECT count(*) FROM metadatavalue WHERE text_lang = '';

We'll be glad if you report back with your findings.

Regards,
~~helix84

Isaak, David C

unread,
Aug 26, 2015, 9:34:03 AM8/26/15
to hel...@centrum.sk, dspac...@lists.sourceforge.net
Hi Helix,

Thank you for your response. Answers to your questions are below:

It seems strange that you would get such column in BME if there were no values in it. I can think of these options:
a) is that an export of your whole repository or only a part of it (e.g. collection)? If it's only a part, maybe the null is in an item that is not part of your current export.

At first, I exported the whole repository, which is when I found the "xyz" column. I then went through and exported each collection individually and found that the "xyz" column is coming from only one collection.

b) if BME doesn't detect any changes, do it in two turns: first, put a bogus value anywhere in that column and import it; second, re-export, remove the bogus value and import it.

I am pretty sure the problem was a result of an upload I did last Friday, so I tried the bogus value import-then-remove method on just those 80 items, but it did not remove the "xyz" column. The collection has about 3000 items in it, so I cannot use the UI BME on the whole collection, but if I use Item Update through the CLI interface, won't the [en_US] be added automatically?

I ran the queries you suggested, but I am new to looking at the database itself, so I am not sure how to interpret these results (I currently have 8712 items in the repository):

SELECT text_lang FROM metadatavalue GROUP BY text_lang;
""
""
"en"
"en_US"

SELECT count(*) FROM metadatavalue WHERE text_lang IS NULL;
25569

SELECT count(*) FROM metadatavalue WHERE text_lang = 'en_US';
342570

SELECT count(*) FROM metadatavalue WHERE text_lang = '';
574

Are these numbers counting metadata fields that do not usually have language codes in text_lang such as dc.identifier.uri, dc.date.accessioned, and dc.date.available?

David

helix84

unread,
Aug 26, 2015, 9:34:06 AM8/26/15
to Isaak, David C, dspac...@lists.sourceforge.net
On Mon, Jun 25, 2012 at 8:42 PM, Isaak, David C <David....@kpchr.org> wrote:
> The collection has about 3000 items in it, so I cannot use the UI BME on the whole collection,

You can raise the limit (for UI) in [dspace]/config/modules/bulkedit.cfg
But I would recommend you to use the command line version of BME
([dspace]/bin/dspace metadata-import) to which this limitation doesn't
apply.

> but if I use Item Update through the CLI interface, won't the [en_US] be added automatically?

I don't know, I don't use that tool much.

> SELECT text_lang FROM metadatavalue GROUP BY text_lang;
> ""
> ""
> "en"
> "en_US"

This looks a bit weird because it shows the empty string twice, but
judging from the following query I assume one of them is only how your
SQL client displays NULL.

> SELECT count(*) FROM metadatavalue WHERE text_lang IS NULL;
> 25569

> SELECT count(*) FROM metadatavalue WHERE text_lang = '';
> 574

So you have both types. The latter is probably the "xyz" case.

> Are these numbers counting metadata fields that do not usually have language codes in text_lang such as dc.identifier.uri, dc.date.accessioned, and dc.date.available?

That's correct. Here's how you would find only the values of a certain
field (here dc.identifier.uri):

SELECT * FROM metadatavalue,metadatafieldregistry WHERE
metadatavalue.metadata_field_id =
metadatafieldregistry.metadata_field_id AND element = 'identifier' AND
qualifier = 'uri' AND text_lang = '';

Regards,
~~helix84

Isaak, David C

unread,
Aug 26, 2015, 9:34:07 AM8/26/15
to hel...@centrum.sk, dspac...@lists.sourceforge.net
Problem resolved, thanks helix84.

Running the query SELECT * FROM metadatavalue,metadatafieldregistry WHERE metadatavalue.metadata_field_id = metadatafieldregistry.metadata_field_id AND element = 'identifier' AND qualifier = 'uri' AND text_lang = ''; resulted in zero hits, but

Running the query SELECT * FROM metadatavalue,metadatafieldregistry WHERE metadatavalue.metadata_field_id = metadatafieldregistry.metadata_field_id AND element = 'identifier' AND qualifier = 'uri' AND text_lang is NULL; came back with four hits.

I looked up those item numbers in the UI and indeed there were fields where the value of the field AND the text lang were both NULL. I deleted the fields through the UI, re-exported the metadata, and the extra fields were gone.

David

Thornton, Susan M. (LARC-B702)[LITES]

unread,
Aug 26, 2015, 9:34:10 AM8/26/15
to Isaak, David C, hel...@centrum.sk, dspac...@lists.sourceforge.net

Just a helpful hint on how to make your query much quicker and easier to type and likely more efficient (better performance)!  I know we all "have our ways", but it works great and is very easy to read.  Here's how I would've coded the second query:

 

Your SQL:  SELECT * FROM metadatavalue,metadatafieldregistry WHERE metadatavalue.metadata_field_id = metadatafieldregistry.metadata_field_id AND element = 'identifier' AND qualifier = 'uri' AND text_lang is NULL;

 

My SQL:    

 

Option 1:   SELECT *

               FROM metadatavalue mv

                  , metadatafieldregistry mr

               WHERE mv.metadata_field_id = mr.metadata_field_id

                 AND mr.element = ‘identifier’

                 AND mr.qualifier = ‘uri’

                 AND mv.text_lang IS NULL;

 

Option 2: (better)   SELECT <only the columns you need)

                        FROM metadatavalue mv

                           , metadatafieldregistry mr

                        WHERE mv.metadata_field_id = mr.metadata_field_id

                          AND mr.element = ‘identifier’

                          AND mr.qualifier = ‘uri’

                          AND mv.text_lang IS NULL;

 

Option 3: (best)   SELECT <only the columns you need)

                        FROM metadatavalue

                        WHERE metadata_field_id = 25   /* identifier.uri */

                          AND text_lang IS NULL;

 

Best regards,

Sue

 

 

Sue Walker-Thornton

(w):  (757) 864-2368

(m):  (757) 506-9903

------------------------------------------------------------------------------

Live Security Virtual Conference

Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

_______________________________________________

DSpace-tech mailing list

DSpac...@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/dspace-tech

Anders Wändahl

unread,
Dec 16, 2020, 11:10:34 AM12/16/20
to DSpace Technical Support

Bringing up an oooold case. What if omitting the language aspect on metadata completely. Always null in the language field, discarding language qualifiers in OAI-PMH imports or any other way of feeding metadata to Dspace. Is this possible and what are the implications?? If the implications are tolerable, anyone who knows where to change the code.

Anders
KTH Royal Inst of Technology, Sweden
Reply all
Reply to author
Forward
0 new messages