best way to index xml files

2,697 views
Skip to first unread message

vijay kumar musham

unread,
Nov 15, 2012, 11:10:27 AM11/15/12
to elastica-...@googlegroups.com
Hello Everyone,

I really appreciate Ruflin for his great work behind Elastica. I am Vijay. I work on data management projects, I started working on Elastica recently and I want to  implement Elastic Search on our web application. I have installed Elastica and ran the tests and wrote few examples myself. Everything is great. I want to know what is the best way to index xmls. I have huge amounts of xml files. A sample xml looks like this.

<document>
<doc_id>...</doc_id>
<title>...</title>
<display_id>...</display_id>
<author label ='xxx' >...</author>
<author>...</author>
<genre authoruty='xxx' >..<genre>
<genre>...</genre>
....
</document>
<document>
....
</document>

I have these xmls stored in the file system as well as mysql database. 
1 ) what's the best way to index xml files, read from db or file system. Can anyone provide some examples of doing it? I have read "Lucene in Action" and it shows how to index the xml document. It parses the xml document, reads each tag and creates a field for it. Should I use the same approach or is there any way I can convert the xml to json and use it. 
2) I am trying to define my mapping like this.
id => string
display_id => string
title => string
author => object ( array {  object ( label => string, text => string ) } ) 
SINCE, there can be one or more authors, Does it work that way ?

3) my web application is to display these documents for add/create or edit . So, the xmls will be changing.. once  the document is modified, do I need to remove the document from the index and create a new document and add it to the index?

I am providing you the sample document here

<document>
<document_id>mnwp000002</document_id>
<record_create_date>04-30-2004</record_create_date>
<indexing_data_id>mnwp</indexing_data_id>
<item_title>Mrs. George Elder Adams, of New York, who took part in the picketing of the White House by members of the Woman&apos;s Party.</item_title>
<author_creator label="pht">Edmonston, Washington, D.C.</author_creator>
<source_collection>Records of the National Woman&apos;s Party</source_collection>
<collection_id>ammem/mnwp</collection_id>
<physical_locator_id label="Location">National Woman&apos;s Party Records, Group I, Container I:147, Folder: Adams, Mrs. George E.</physical_locator_id>
<document_type>still image</document_type>
<genre authority="bgtchm">Photographs</genre>
<medium>1 photograph: print; 4 x 6 in.</medium>
<text_date>[ca. 1917-1920]</text_date>
<language_of_cataloging>eng</language_of_cataloging>
<digital_origin>reformatted digital</digital_origin>
<subject label="lcsh">National Woman&apos;s Party</subject>
<subject label="lcsh">Suffragists--United States--1910-1920</subject>
<subject label="lcsh">The Suffragist (serial)</subject>
<subject label="lcsh">Women--Suffrage--New York (State)</subject>
<subject label="local">Adams, Mrs. George Elder</subject>
<geog_subject>
  <country>United States</country>
    <state>New York</state>
</geog_subject>
<note label="Summary">Studio portrait, Mrs. George Elder Adams of New York, in hat and fur stole, standing with a copy of the newsletter The Suffragist in her hands.</note>
<note>Title transcribed from item.</note>
<digital_object fileGrp_ptr="GRP001">
 <do_reference>
   <do_digital_id>147001</do_digital_id>
   <do_aggregate>mnwp</do_aggregate>
 </do_reference>
  <do_handle_information>hdl:loc.mss/mnwp.147001</do_handle_information>
  <do_display_type_id>p</do_display_type_id>
</digital_object>
<restriction_description>No known restrictions on use or reproduction.</restriction_description>
 <division_id>mss</division_id>
<date_sorter>19170000</date_sorter>
<fileSec>
<fileGrp ID="GRP001">
<file ID="FILE001" MIMETYPE="image/tiff" CREATED="2004-12-22" USE="master" SEQ="1" SIZE="5198214">/master/mss/mnwp/147/147001u.tif</file>
<file ID="FILE002" MIMETYPE="image/gif" CREATED="2005-06-29" USE="thumbnail" SEQ="1" SIZE="7980">/service/mss/mnwp/147/147001t.gif</file>
<file ID="FILE003" MIMETYPE="image/jpeg" CREATED="2005-06-29" USE="service-high" SEQ="1" SIZE="657830">/service/mss/mnwp/147/147001v.jpg</file>
<file ID="FILE004" MIMETYPE="image/jpeg" CREATED="2005-06-29" USE="service-low" SEQ="1" SIZE="27897">/service/mss/mnwp/147/147001r.jpg</file>
</fileGrp>
</fileSec>
</document>

Please, provide me with any samples or documentation that helps.

Really appreciate any kind of help.. !!
 - Vijay.
                                     
               

ruflin

unread,
Nov 19, 2012, 7:53:04 AM11/19/12
to elastica-...@googlegroups.com
Hi Vijay

As you ask quite a few question, I try to answer them here.

Read in data:
As you have your data in xml files and the database both would work. I assume the main problem is the conversion. There are xml processor for PHP so you could read it into an associative array in PHP and then convert it to JSON or read it directly from the db. I would assume reading directly from the db is less a hassle and faster. But it should not matter that much in the end.

Mapping:
For mapping an array take a closer look here:

Updating / Replacing:
If your documents update and you send the document with the same id to the elasticsearch server, it will automatically overwrite your current document. There is also an option to update parts of documents. But I assume this is not what you need.

I hope this helps.

Nicolas

vijay kumar musham

unread,
Nov 19, 2012, 1:30:36 PM11/19/12
to elastica-...@googlegroups.com
Hello Ruflin,

Thanks a lot for clarifying my doubts. I will work on it using your inputs.

Thank you,
Vijay.

vijay kumar musham

unread,
Dec 14, 2012, 11:07:59 AM12/14/12
to elastica-...@googlegroups.com
Hello Ruflin,

I am able to deploy Elastic Search in my web application. Everything works fine. I was able to create index with mapping and added 15800 xml documents to it. I wrote a php script, which gets the index and creates the query object using Elastica_Query_Builder . I am sending the query string ( words to be searched) from the web browser ( another php file, with a input text box).

$queryString = "{
\"query_string\" : {
\"query\" : \"".$_POST['value']."\",
\"fields\" : [
\"author_creator_text\",\"genre_text\",\"collection_id\",\"credit_line\",\"date_sorter\",\"@fileGrp_ptr\",\"do_display_type_id\",\"do_handle_information\",\"do_aggregate\",\"do_digital_id\",\"digital_origin\",\"division_id\",\"document_id\",\"document_type\",\"city\",\"county\",\"state\",\"country\",\"geog_subject_unparsed\",\"indexing_data_id\",\"item_title\",\"language\",\"language_of_cataloging\",\"medium\",\"note\",\"other_repository\",\"physical_locator\",\"physical_locator_id\",\"publication_date\",\"publication_location\",\"publisher\",\"record_create_date\",\"ro_caption\",\"ro_display_type_id\",\"ro_document_id\",\"ro_handle_information\",\"ro_web_id\",\"related_name\",\"reproduction_no\",\"restriction_description\",\"source_collection\",\"status\",\"subject\",\"text_date\",\"@label\",\"@SIZE\"
],
\"use_dis_max\" : true
}
}";
        $query = new Elastica_Query_Builder($queryString);

I went through the documentation at this site, 

Everything works  fine, I am getting the search results back. for example, if I give "lomax" in the text field. It search all documents and fields given above and returns the documents found. If I give "lomax and interviews", returns the results back. Thats perfect.

My question is If I want to give a query string like this "author_creator:lomax and genre_text:interviews" . I want to search "lomax" in "author_creator" and "interviews" in "genre_text" .Basically instead of giving all fields in the script, is there any way I can send the fields along with the query string? and later parse it in the script, before constructing the query object.

Thanks and regards,
Vijay.

ruflin

unread,
Dec 15, 2012, 7:18:45 AM12/15/12
to elastica-...@googlegroups.com
I don't know the search builder from clint in detail. But query_string is one of the most raw forms to create a query. There are tons of more specific queries available. I would recommend you to chain these different query types to reach the result you are looking for. Probably you are looking for Elastica_Query_Term where you can set the field for every query and Elastica_Query_Bool to chain your queries:

Reply all
Reply to author
Forward
0 new messages