What fields are searched within a dataset (or datafiles??)

9 views
Skip to first unread message

Sherry Lake

unread,
Jan 15, 2020, 2:22:03 PM1/15/20
to Dataverse Users Community
The search box on a dataset page is under the "file" tab which makes me think that file fields are searched?

But what are the fields that are searched? 

One reason why I am confused maybe because I think I found a bug in displaying search results, but before I create an issue I want to understand what is searched.



has 356 files - 351 of them are tagged "sib"(and of those 348 have file names that end in ".sib").

So why when I search for "sib", why do I only see 32 results?

Another search I tried was "xml". There are 5 files tagged "xml"; all of them have file names that end in ".xml". So why do I only get 1 when I search for "xml"?



Thanks for any insights!!

Sherry


Heppler, Michael

unread,
Jan 15, 2020, 4:56:12 PM1/15/20
to dataverse...@googlegroups.com
Hello, Sherry.

After getting some background on this feature from our developer, Leonid, I was able to get some answers for you.

According to getFileIdsInVersionFromSolr in DatasetPage.java, the only fields that search box searches are FILE_NAME and FILE_DESCRIPTION.

(There is an issue already to add more fields to that search box, like File Tags, UI to enable file tags to be searched #4122 or MD5/UNF, Search: Allow searching by md5, possibly UNF from UI #3436.)

That would explain why you couldn't get all the files you tagged returned in the results. Now, the reason your searches for file extensions like "sib" and "xml" didn't work as expected is because the FILE_NAME is saved in Solr as a string, and it doesn't do a very good job of separating out the file extensions from the rest of the text string.

That said, you can use wildcard search syntax to get the results you expected. In your example dataset, if you search "*sib", using the asterisk or star for the wildcard, you would get your 348 files with the sib file extension.

Not sure why you get 1 file when you search "xml", but my guess is that it might have something to do with the parenthesis as some kind of separator in the file name string... I don't know, I am just guessing.

Hopefully that helps. As always, feel free to open a new issue in GitHub if you believe this feature, or the documentation for it, can and should be improved.

Happy searching!

Mike

--

Michael Heppler
User Interface Designer + Developer, Dataverse Project
Institute for Quantitative Social ScienceHarvard University
1737 Cambridge Street, Rm K333, Cambridge, MA 02138
www.iq.harvard.edu

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/7f2477df-e759-4540-b793-97afcee2a8d6%40googlegroups.com.

Sherry Lake

unread,
Jan 16, 2020, 8:55:45 AM1/16/20
to dataverse...@googlegroups.com
Yes!! Thanks again Michael!

Searching for "xml files" with *xml in the search box gives me all files with the "xml extension" . NOW I can tag this whole batch on the dataset page - because of problem tagging on upload (see https://github.com/IQSS/dataverse/issues/2842).

And with more of my files tagged, I can make better use of the "filter By filetag" feature.

--
Sherry

Reply all
Reply to author
Forward
0 new messages