Large data set is no longer searchable

82 views
Skip to first unread message

Eric Sagara

unread,
Sep 30, 2012, 2:48:09 PM9/30/12
to panda-pro...@googlegroups.com
Hello all,

I am still working with this large data set, it contains about 5 million records split among 21 different files. While I have yet to import the final file (due to another outstanding issue also posted in this group), it appears the data is no longer returning valid results. Searching for data we know is in the set returns no values. I tried to re-index, but the re-indexing failed, with the below error message:

The import failed with the following exception:

pop from empty list Traceback: File "/usr/local/lib/python2.7/dist-packages/celery/execute/trace.py", line 181, in trace_task R = retval = fun(*args, **kwargs) File "/opt/panda/panda/tasks/reindex.py", line 65, in run data = read_buffer.pop(0)


Now my Python is a bit rusty and I have not been able to look at the actual reindex.py file (I don't have access to the back end as far as I know), but I worry that the data set has too many records, has too many individual files or a combination of both. Is there a solution or at least a workaround? I am tempted to try re-importing with files that are clustered together, but when the files are too large I tend to get 502 and 504 errors.

Thanks all,

Eric Sagara

Eric Sagara

unread,
Sep 30, 2012, 5:40:34 PM9/30/12
to panda-pro...@googlegroups.com
Ok so I looked through reindex.py on the github repository and I am wondering if increasing the read_buffer_size would help.

Christopher Groskopf

unread,
Dec 29, 2012, 10:08:43 PM12/29/12
to panda-pro...@googlegroups.com
Hey Eric,

I'm looooong overdue in getting back to you about this error. I apologize for that. Things got away from me with the new job. Please let me know if/how you resolved this. If there is an outstanding issue I'd like to help you debug and ticket bugs as necessary.

Best,
Chris

Daniel Kirchberger

unread,
Feb 12, 2015, 1:28:35 PM2/12/15
to panda-pro...@googlegroups.com
Hey guys,

I'm having the same issue with a similar sized dataset. I'm wondering if changing that buffer size fixed the problem. 

Thanks!

Dan 

Serdar Tumgoren

unread,
Feb 12, 2015, 8:15:44 PM2/12/15
to panda-pro...@googlegroups.com

Hey Daniel,
A few questions to help debug:

* How big is the file in bytes?
* Does the error occur on import or reindex?
* In what context are you triggering the command that causes the error? (Web admin, bulk import mgmt command, custom loader script that uses api, etc.).
* What version of panda are you running?

Also, can you send a gist of the exact error output?

Serdar

--
You received this message because you are subscribed to the Google Groups "PANDA Project Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to panda-project-u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Daniel Kirchberger

unread,
Feb 13, 2015, 1:40:11 PM2/13/15
to panda-pro...@googlegroups.com
* How big is the file in bytes?
+ It's 11 separate CSVs for a total of 1,086,930,117 bytes. In total, there are 5,271,644 rows. 

* Does the error occur on import or reindex?
+ I had no problem on import, perhaps because they were separate files? 

* In what context are you triggering the command... 
+ I was adding column search for a single column from the actions menu. I originally had an error that the device was out of space (only had 5gb available). I increased the storage, tried adding the column search again and received this new error immediately:

pop from empty list

Traceback:
  File "/usr/local/lib/python2.7/dist-packages/celery/execute/trace.py", line 181, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/panda/panda/tasks/reindex.py", line 66, in run
    data = read_buffer.pop(0)

* What version of panda are you running?
+ I am on 1.1.1

Thanks so much, Serdar!
To unsubscribe from this group and stop receiving emails from it, send an email to panda-project-users+unsub...@googlegroups.com.

Serdar Tumgoren

unread,
Feb 13, 2015, 2:23:15 PM2/13/15
to panda-pro...@googlegroups.com
So it sounds like the current problem is related to the reindex operation. 

I haven't seen that particular error myself, though I have had trouble re-indexing "large" data sets, which inspired some tweaks to the manual (aka bulk) import management command. 

The modified version of the command allows you to configure field-level search on initial import. This sidesteps memory issues related to re-indexing, which is handled differently than initial upload on the backend.

Barring a better suggestion from the PANDA team or other list members, you might want to try pulling down the bulk management command updates from my fork. This involves a bit of a cumbersome git workflow -- you'll need to set my fork as a secondary upstream remote and fetch/merge the code from that upstream -- so let us know if you need help on that front.

Once you've got the code:
  • Merge all of the original 11 files into a single file (make sure you skip header rows if all files have them; you should only have a single header row at top of file)
  • Move the merged file up to your PANDA server
  • Follow the steps in the bulk import docs
Let us know if you need help on any part of this process.

Best,
Serdar

To unsubscribe from this group and stop receiving emails from it, send an email to panda-project-u...@googlegroups.com.

Daniel Kirchberger

unread,
Feb 17, 2015, 4:47:37 PM2/17/15
to panda-pro...@googlegroups.com
Amazing. I can't thank you enough for the thorough explanation. Everything worked as advertised. We already had a reporter use the data for a story today. 

Just an FYI (and this might've been something on my end):
When I uploaded the compiled CSV, it contained no quotes for the header fields or row values. I've never had a problem when using the web-based uploader with this, but the bulk upload was combining multiple headers and separating column values by spaces, regardless of the delimiter. I tried again, putting every value in quotes, and it worked fine. It could've been an encoding issue on my end (I used csvstack from csvkit for the first time) or some issue with the data set; I didn't come close to checking the 5.3 million rows for any errors. I wasn't sure if there were flags to specify the delimeter I could've thrown in when I ran the manual_import command. 

Thanks, again!

Dan
To unsubscribe from this group and stop receiving emails from it, send an email to panda-project-users+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Serdar Tumgoren

unread,
Feb 17, 2015, 5:48:58 PM2/17/15
to panda-pro...@googlegroups.com
On Tue, Feb 17, 2015 at 1:47 PM, Daniel Kirchberger <djk...@gmail.com> wrote:
Amazing. I can't thank you enough for the thorough explanation. Everything worked as advertised. We already had a reporter use the data for a story today. 

Wonderful! Very glad it worked for you. Regarding the header issues, perhaps you had minimal quoting initially but fully quoted fields fixed the problem?

Also, a final disclaimer on the re-indexing fix:  Now that you've got the data in PANDA, you should likely avoid re-indexing it going forward through the web admin. 

As mentioned, the web interface handles re-indexing differently than this bulk mgmt command, so you'll likely encounter the same memory-related problems you were seeing originally. If you ever need to update field-level search,  the wisest course is to update the schema format file and re-run the bulk command. It's a kludgey work-around and it makes me itch that the re-index functionality remains exposed in the web interface for a data set that can be problematic to re-index using the standard web-based approach, so this is definitely feels like something that deserves a better end-user facing solution (perhaps as simple as toggling a bit that hides re-index functionality on bulk loaded docs).

Brian Boyer suggested a while back that PANDA folk might want to get together at NICAR in ATL to discuss the future of the PANDA project, so perhaps this issue could be folded into that larger discussion.  Brian -- I don't want to steal your thunder so hit us up with any details about time and place in ATL. 

Best,
S.
 
 
To unsubscribe from this group and stop receiving emails from it, send an email to panda-project-u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages