CSV File Size Limit?

1,228 views
Skip to first unread message

Chris Vargo

unread,
Feb 25, 2013, 12:08:19 PM2/25/13
to overvie...@googlegroups.com
Hi Everyone --

A UNC J-School Ph.D here. I've been using the initial release of Overview for awhile now, and I'd like to make a jump to the new site. However, I use large collections of Twitter data. Is there a file limit on the .csv upload feature? If so, what is it?

Thanks for all you do. This is truly a great tool for document clustering.

Best,
Chris

Jonas Karlsson

unread,
Feb 25, 2013, 12:27:00 PM2/25/13
to overvie...@googlegroups.com
Hi Chris,

Currently, we only process the first 20,000 rows of csv files (not counting the header row).
Over time, this number will increase, as we improve our algorithms.

How large are your files, and how many tweets do they usually contain?

_jonas

Adam Hooper

unread,
Feb 25, 2013, 12:31:10 PM2/25/13
to overvie...@googlegroups.com
Right now, we're limited to 20,000 documents. In your case, that means
20,000 tweets. The 20,001st and subsequent documents in the .csv (or
on DocumentCloud) will be ignored.

We haven't decided upon other limits (actual .csv-file size, for
instance, or maximum size per document). We'll document these by and
by. In the meantime, I assume we'll handle at least 1GB per file upload and 500kb per
document just fine.

Enjoy life,
Adam

Chris Vargo

unread,
Feb 25, 2013, 1:43:58 PM2/25/13
to overvie...@googlegroups.com
Thanks! It all depends. I usually look at around a million Tweets at a time, but that all depends on what question I'm trying to answer. I'll play around with some trimmed samples of 20k for now. 

Are you guys still allowing downloads of the java version of Overview?

Best,
Chris

Jonathan Stray

unread,
Feb 25, 2013, 8:29:25 PM2/25/13
to overvie...@googlegroups.com
You can still download the prototype (Java) version but we are no longer supporting it. Anyway, it won't handle more than about 20-30k documents either.

What questions are you asking of a million tweets? We would like to get to that size, but there are significant challenges all the way down the stack (from upload to UI.) Knowing what you are doing will help us prioritize features.

  - Jonathan

Chris Vargo

unread,
Feb 25, 2013, 10:32:27 PM2/25/13
to overvie...@googlegroups.com
Well, I work with Dr. McCombs and Dr. Shaw in agenda-setting theory. So generally, the clustering of issues, how different issues are related, and the top attributes that are associated with salient issues. The 20k limit is just fine for the newsmedia we capture. But to start looking at the public, and how they're talking about issues, a larger limit would be needed.

I just tried the new tool with some news media tweets, and the results are promising. The clustering really picked up on the big issues for that given time period. 

I look forward to what lies ahead for Overview. Thanks again for answering my question!

Jonathan Stray

unread,
Feb 26, 2013, 3:23:07 PM2/26/13
to overvie...@googlegroups.com
Indeed. I'm familiar with McCombs' and Shaw's classic work. I assume you've also seen MediaCloud at MIT, which also attempts media monitoring?

So what would Overview do for you in this setting, ideally?
 
 - Jonathan
Reply all
Reply to author
Forward
0 new messages