Issue with fastIngest on second time.

Sateesh Babu Penumalla

unread,

May 14, 2013, 6:18:06 AM5/14/13

to echo...@googlegroups.com

Hi,

I am trying to ingest the community data and query for the tracks available in community data.

I am executing the below command to fast ingest the data:

python2 fastingest.py ../../data/echoprint-dump-1-1-1.json

And it is ingesting fine.

When I query for the data, I get the below response. It finds the track properly :

{"total_time": 1274, "score": 125, "ok": true, "query": "eJztWGtuHjcMvJJeFKXj6MX7H6Gj2dZKP9gqnCZAnNY_BsZyORxKJKX9nPMzuQtYvEAI4QbibqB2g7pu0OcNVr9AzHqDmm-w4gUS3rhAmzeY4wIi5QYl32Bv44eQXbjB3qgP4V45wYcb3GujrBuMcYHo2g1SvcFesI9hpAskF2-Q_A12yX8Mu6g_BusXkFhv0OUGu8U_hP93_1fY_X_YX9Ub_Ib7W25w3d8v2b8_cX_jDb5g__6r_S3uBl9yf3-3_v015_NP3P3r2f1f2_0vub-_aP_OC_yi8_nLdegIPWVNqUuQHKb0ISNZkrayb7XG3LOOEqKt0HPtsVTJudbm8DlIa6pavfZmawaba1kRCTmP6hPen9Xhbc0jP9bhfJ02i3mk7DX7JavbrKI-GyT1lmPTQVXBeVliedkwZz2NtGoe-KSnjDVTxIPV9wJYdtWqE6fg28JnrWO5NlwrXnRK9j6FOixR7iwZRbRfwUd8w01i4E9zboxm2qdgUc1q6n1JGSWpbw4JaWt0owwye1A8zFw_iuQrE6smq3r8rzXrpzTTjVb6UvNDwPfe1PPlZ9-2cOqjeotVlSnQN9WCwLRaqdisOpM2bBX3Q0rtslpKpYeVxgGsWwovzw7MvvLUVBT7u3EF7HvMQ3orlksNUxdeKq7pQp3EGGmd2Y_6o33RIi3SSt-wWuu0Hl8yTydD6GUlo7wZV2Lvj_WNmXGPKjLnEUejqveYP9T8eeYAkV3TWg1NgCNnMx8-H_uItFpcYTwLsflI1dXropsfvgfyHRmHFJ2Ntvu7Ko851A9Q1cuzA-pmHR2V1ZL50AeiGRpCBnbCapVoPrWCyKG7GLAA4bGWvn_e-rG-FQVdaKVvS3lXBKzf-G7muJIF-sZQwqRv0-IdrYeZcb9RtZnDjNKp6jB_GFelWkYlJIktofV23BMN47K1kcZYcZZmT8htpYy5MHYs15TylNaphVQhrKATB1SnFRWSC0NSy5PCjvuE3MzM8iQ4Z8GevwEzenl2QERjdxmHDyaftIHxv7RKijgyVm8T0yl6VUzVOdBuzhutmFwlfqcvrcf3sTpMKOux4dQPlpOz4I-H79knWhE4LpLSSipa6aZZlpLviKQbmmI0WqmqFt-Vml9UpSSo-K0KfVEXM_JlTfee5pd8qfnkezQz-MnIi7f5khEFvZcRrdRyZFgr2Q6Q-eXZN9awb1KxBLTPPhRGb346lRxdwJmOJZG0pgS3mrrVt4boMrJoAX2y0oyGUaTeoeSxMoW-ghN94LwTa7gBBBxDAyWsuBD4vQabGSe_q7TSFx0xcQqPnGONPpH58K2qUWnVgmsD3cjHQDixkYC1aYI2NOYR_Zxoi-X8IVUQDlc67htYFB8bBj9jSIy4Ymyqz68GFnF37FyFmslM36OZwZ98t-ZnhdQyzn0cziO18bjRgy87jXkf0U4T7s_y5xqAKgzMGAwJFw5w_V6eHaj7ptvTwpV292MtCzgU1ygkhduImfNjoXAxqFZOsUxaAwqsfafvY33zpfX4kjkLJip9yfzq-4OZTTEfn4w28yrRzffiZh_iIDN8eiSzSS9yj_v4vsV9byU_la8mH4PD6b9nsHt24USDHFsH6Pby7EBDc-JOg3Iyty_ZWYNiNzEs3Gy4LVTc4WiVVOXFiq--FUNqkmgVnBuLVqwNBtZnmAVXBdPsAkShUcj8Dd_EOUYrhudwdCMfqej7uFXrlXxHBt0wS0ejlXEHDkf_niofayvMyHd8B1AB5ld9N99ay_B1xLCKokWZ78kDnzL7mgBrxCCeDEQrU6AVnzQoom0lPVOgNSfcFA6Q-eXZAZRNwqROzcbWNGt0VTZdD81jbKKSpAdVWn3G0EQRuZnEDU-rt5VaQF6O1uMLC77vti8-FEIMKDGXs81G3xOXvlYGirTh1oRvD130pQf5MFWHxpErziHcRehGpbZwFtCXbrRS6ZFBt4ilA0Mb-fOaO65V6WiuEV1OX8EcflaDMuhL5pd8zZaGk2_HZ2Ci5uAbPtDe8j1aTr5_af4zX8oYUkM5QC0vzw78AYM1q0E=", "message": "OK (match type 6)", "qtime": 64, "match": true, "track_id": "TRAVIZZ123E858ECF7"}

And then I do fastingest using below command :

python2 fastingest.py ../../data/echoprint-dump-1-1-2.json

It ingests the data properly. If I check the count of tracks in admin GUI it shows the increased number.

Now, after fastIngesting the 2nd json file, when I try to query for the same track, it gives "no results found (type 7)"

{"total_time": 1277, "score": 0, "ok": true, "query": "eJztWGtuHjcMvJJeFKXj6MX7H6Gj2dZKP9gqnCZAnNY_BsZyORxKJKX9nPMzuQtYvEAI4QbibqB2g7pu0OcNVr9AzHqDmm-w4gUS3rhAmzeY4wIi5QYl32Bv44eQXbjB3qgP4V45wYcb3GujrBuMcYHo2g1SvcFesI9hpAskF2-Q_A12yX8Mu6g_BusXkFhv0OUGu8U_hP93_1fY_X_YX9Ub_Ib7W25w3d8v2b8_cX_jDb5g__6r_S3uBl9yf3-3_v015_NP3P3r2f1f2_0vub-_aP_OC_yi8_nLdegIPWVNqUuQHKb0ISNZkrayb7XG3LOOEqKt0HPtsVTJudbm8DlIa6pavfZmawaba1kRCTmP6hPen9Xhbc0jP9bhfJ02i3mk7DX7JavbrKI-GyT1lmPTQVXBeVliedkwZz2NtGoe-KSnjDVTxIPV9wJYdtWqE6fg28JnrWO5NlwrXnRK9j6FOixR7iwZRbRfwUd8w01i4E9zboxm2qdgUc1q6n1JGSWpbw4JaWt0owwye1A8zFw_iuQrE6smq3r8rzXrpzTTjVb6UvNDwPfe1PPlZ9-2cOqjeotVlSnQN9WCwLRaqdisOpM2bBX3Q0rtslpKpYeVxgGsWwovzw7MvvLUVBT7u3EF7HvMQ3orlksNUxdeKq7pQp3EGGmd2Y_6o33RIi3SSt-wWuu0Hl8yTydD6GUlo7wZV2Lvj_WNmXGPKjLnEUejqveYP9T8eeYAkV3TWg1NgCNnMx8-H_uItFpcYTwLsflI1dXropsfvgfyHRmHFJ2Ntvu7Ko851A9Q1cuzA-pmHR2V1ZL50AeiGRpCBnbCapVoPrWCyKG7GLAA4bGWvn_e-rG-FQVdaKVvS3lXBKzf-G7muJIF-sZQwqRv0-IdrYeZcb9RtZnDjNKp6jB_GFelWkYlJIktofV23BMN47K1kcZYcZZmT8htpYy5MHYs15TylNaphVQhrKATB1SnFRWSC0NSy5PCjvuE3MzM8iQ4Z8GevwEzenl2QERjdxmHDyaftIHxv7RKijgyVm8T0yl6VUzVOdBuzhutmFwlfqcvrcf3sTpMKOux4dQPlpOz4I-H79knWhE4LpLSSipa6aZZlpLviKQbmmI0WqmqFt-Vml9UpSSo-K0KfVEXM_JlTfee5pd8qfnkezQz-MnIi7f5khEFvZcRrdRyZFgr2Q6Q-eXZN9awb1KxBLTPPhRGb346lRxdwJmOJZG0pgS3mrrVt4boMrJoAX2y0oyGUaTeoeSxMoW-ghN94LwTa7gBBBxDAyWsuBD4vQabGSe_q7TSFx0xcQqPnGONPpH58K2qUWnVgmsD3cjHQDixkYC1aYI2NOYR_Zxoi-X8IVUQDlc67htYFB8bBj9jSIy4Ymyqz68GFnF37FyFmslM36OZwZ98t-ZnhdQyzn0cziO18bjRgy87jXkf0U4T7s_y5xqAKgzMGAwJFw5w_V6eHaj7ptvTwpV292MtCzgU1ygkhduImfNjoXAxqFZOsUxaAwqsfafvY33zpfX4kjkLJip9yfzq-4OZTTEfn4w28yrRzffiZh_iIDN8eiSzSS9yj_v4vsV9byU_la8mH4PD6b9nsHt24USDHFsH6Pby7EBDc-JOg3Iyty_ZWYNiNzEs3Gy4LVTc4WiVVOXFiq--FUNqkmgVnBuLVqwNBtZnmAVXBdPsAkShUcj8Dd_EOUYrhudwdCMfqej7uFXrlXxHBt0wS0ejlXEHDkf_niofayvMyHd8B1AB5ld9N99ay_B1xLCKokWZ78kDnzL7mgBrxCCeDEQrU6AVnzQoom0lPVOgNSfcFA6Q-eXZAZRNwqROzcbWNGt0VTZdD81jbKKSpAdVWn3G0EQRuZnEDU-rt5VaQF6O1uMLC77vti8-FEIMKDGXs81G3xOXvlYGirTh1oRvD130pQf5MFWHxpErziHcRehGpbZwFtCXbrRS6ZFBt4ilA0Mb-fOaO65V6WiuEV1OX8EcflaDMuhL5pd8zZaGk2_HZ2Ci5uAbPtDe8j1aTr5_af4zX8oYUkM5QC0vzw78AYM1q0E=", "message": "no results found (type 7)", "qtime": 51, "match": false, "track_id": null}

Please let me know if I am missing anything.

Thanks,

Sateesh

Andrew Nesbit

unread,

May 14, 2013, 10:57:46 AM5/14/13

to echo...@googlegroups.com

Hi Sateesh,

The issue is that the matching logic in fp.py does not really allow for multiple FPs per song. What happens is that mutiple matches corresponding to the same song are returned from the Solr and TTyrant back ends to fp.py . Even though these duplicate matches score highly, because their scores are very similar the matching logic is unable to decide whether the high-scoring matches are a set of false positives or the same song.

I recognise that this is a shortcoming and that this issue needs solving, i.e., modifying the matching logic and ingestion paths to allow for multiple FPs per song. We're now working on a solution.

In the short term, the workaround is that you should avoid ingesting the same track twice.

Andrew

--
You received this message because you are subscribed to the Google Groups "echoprint" group.
To unsubscribe from this group and stop receiving emails from it, send an email to echoprint+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Sateesh Babu Penumalla

unread,

May 15, 2013, 12:59:21 AM5/15/13

to echo...@googlegroups.com

Hi Andrew,

Thank you for your quick reply.

There seems to be small misunderstanding (My subject seems to be misleading). I am not ingesting the same track again. I am splitting each community data file into 25 parts to avoid the memory issue. After ingesting the first file, the query gives proper result. But after ingesting the 2nd file, the same query is giving mismatch.

Every time I start the process I am wiping the data both in solr and tyrant as mentioned in one of the posts using below commands. I hope this ensures that I am not ingesting the same track again.

Commands used to wipe the codes:

python2 wipe_codes.py

tcrmgr vanish localhost

Thanks,

Sateesh

Reply all

Reply to author

Forward