lucene index files missing after hadoop batch indexing

Zhou Lizhi

unread,

May 6, 2013, 3:42:12 AM5/6/13

to sensei...@googlegroups.com

I adjusted the map reduce job in the car demo to lzo format inputs and changed 'mapred' style APIs to new 'mapreduce' style APIs. Mapreduce job worked fine. Then I pushed the index files to SenseiDB, but I got only DocID and UID in a record. I found some term vector related index files (.tvf, .tvx, tvd, .nrm) were missing, while I just modified mapper and reducer but not Lucene writers. How can I fix this? Thanks.

PS. This is my input. I guess schema.xml will filter out the useful fields.

{"__isset_bit_vector":[1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,1,1,1,0,0,1],"acceptLanguage":"zh-cn","adcoreVersion":3,"adspaceHeight":0,"adspaceId":1017927,"adspaceOperationMode":"OWN","adspacePageType":"PAGE_HOME","adspacePosition":"POS_1","adspaceWidth":0,"advertiserId":189,"amCreativeType":"FLASH","bannerId":10210449,"bidMode":"CPM","bottomCreativeFlag":false,"breedIds":["17795143","335545243"],"browser":"IE8","campaignId":119403,"candIndex":0,"channelId":2843,"clickId":0,"clickeds":0,"cookie":3499577913538249008,"creativeType":0,"ctrInfo":{"__isset_bit_vector":[1,1],"ctr":0.0,"explorationScore":-1.0,"fee":100001000,"pCtr":0.0,"rpm":0},"dormer":false,"entityType":"AdViva","errorFlag":0,"eventTime":1366777563000,"eventType":115,"extfield":{"of":"1","showid":"Lqv6ld","type":"1","uid":"13667775605468123681402702329017"},"fee":100001000,"fraud":false,"frequencyInfo":[{"__isset_bit_vector":[1,1,1,1,1],"adGroupFreq":2,"advertiserFreq":2,"campaignFreq":2,"creativeFreq":2,"period":3600},{"__isset_bit_vector":[1,1,1,1,1],"adGroupFreq":5,"advertiserFreq":5,"campaignFreq":5,"creativeFreq":5,"period":86400}],"geoInfo":{"__isset_bit_vector":[1,1,1,1],"city":268,"country":1,"county":0,"province":263},"hotzoneHeight":0,"hotzoneWidth":0,"impressionId":8735081080326122404,"ip":[113,120,37,-118],"ipv6":false,"landingPageUrl":"","language":"ZH_CN","logId":"108122603723439","onePixel":false,"os":"WINDOWS_XP","pageNo":0,"pageReferralUrl":"","publisherId":130,"rawAdspaceId":0,"rawIP":[113,120,37,-118],"referrerUrl":"http://d2.sina.com.cn/rwei/mediav2013/240x330ls_an09_2013mediav.html","responseTime":1166,"servingDB":"mediav","solutionId":1073632,"solutionType":"AdViva","thirdPartyInfo":[{"__isset_bit_vector":[1],"vendorFields":{"SINAGLOBAL":"113.120.37.138_1366775259.946859","mvsign":"v%3Dv%28myJkZl7YAURSd%60U%3Axu"},"vendorId":0}],"userAgent":"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)","valid":true,"validClickeds":0}

Zhou Lizhi

unread,

May 7, 2013, 4:03:54 AM5/7/13

to sensei...@googlegroups.com

I have converted LZO files to JSON files. This is a line in json file.
{"adspaceHeight":0,"campaignId":0,"publisherId":0,"impressionId":0,"solutionId":0,"channelId":0,"candIndex":0,"adspaceWidth":0,"eventType":109,"creativeType":0,"adspaceId":0,"cookie":1329495655428653360,"fee":0,"bannerId":0,"eventTime":1366775890000,"clickId":0,"responseTime":93,"advertiserId":0}

and this is my schema.xml

<schema>
<table uid="cookie">
<column name="cookie" type="long" />
<column name="eventTime" type="long" />
<column name="eventType" type="short" />
<column name="publisherId" type="int" />
<column name="channelId" type="int" />
<column name="adspaceId" type="int" />
<column name="advertiserId" type="int" />
<column name="campaignId" type="int" />
<column name="solutionId" type="int" />
<column name="bannerId" type="int" />
<column name="clickId" type="long" />
<column name="responseTime" type="int" />
<column name="adspaceWidth" type="int" />
<column name="adspaceHeight" type="int" />
<column name="creativeType" type="short" />
<column name="impressionId" type="long" />
<column name="fee" type="int" />
<column name="candIndex" type="short" />
</table>

</schema>

Then I tried building index from sensei file gateway, still cannot query anything and got this error repeatedly
ERROR [AsyncDataConsumer] [] fatal: indexing thread loader manager has stopped
proj.zoie.api.ZoieException: fatal: indexing thread loader manager has stopped

I guess there is something wrong with my data and schema configuration, but I have no idea about solving this. Anyone can help me? Thanks a lot

John Wang

unread,

May 7, 2013, 3:18:20 PM5/7/13

to sensei...@googlegroups.com

This error indicates there is bad data according to the schema. The indexing thread is forced to stop to avoid version corruption.

Try to add this to your sensei.prooperties:

sensei.index.skipBadRecords = true

And see if works.

If it does, then your data had bad records.

-John

--
You received this message because you are subscribed to the Google Groups "Sensei" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sensei-searc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Zhou Lizhi

unread,

May 8, 2013, 10:37:06 PM5/8/13

to sensei...@googlegroups.com

No, it doesn't work. I still get query results like this

{
        "_docid": 4,
        "_grouphitscount": 0,
        "_score": 0,
        "_srcdata": "",
        "_uid": 111124745615155
    }

uid is correct, but other columns are missing. No error message is found in sensei-main.log.

Zhou Lizhi

unread,

May 9, 2013, 3:18:51 AM5/9/13

to sensei...@googlegroups.com

I think I have fixed this. I missed facets part in schema. It's my fault. I thought facets were optional and columns could be queried by default.

Yonghui Zhao

unread,

May 9, 2013, 9:49:33 AM5/9/13

to sensei...@googlegroups.com

If you just want to get some column value, you don't need set it as facet.

You can set it storable in schema like this:

And enable fetchStore in your query.

2013/5/9 Zhou Lizhi <reaso...@gmail.com>

John Wang

unread,

May 10, 2013, 12:17:44 AM5/10/13

to sensei...@googlegroups.com

Keep in mind using it as a stored field effectively retrieves data from the disk and is much slower.

Reply all

Reply to author

Forward