How to get original Facebook tsv file?

89 views
Skip to first unread message

du Su

unread,
Nov 17, 2014, 5:03:31 PM11/17/14
to swimapredu...@googlegroups.com
In SWIM, you use the command below to generate a FacebookTrace.tsv file, and then sample this file to get the final Synethsizes workloads.

perl parse-hadoop-jobhistory.pl FB-2009_LogRepository > FacebookTrace.tsv 

But how can I get FacebookTrace.tsv so that I could design my own workload from the source data?

Thanks!

Yanpei Chen

unread,
Nov 19, 2014, 7:15:17 PM11/19/14
to swimapredu...@googlegroups.com
Thanks for your interest!

The the raw FacebookTrace.tsv contains some proprietary Facebook info and is not public.

What would you like to do? Asking to see if we can adapt the public info to fit your purpose.

Cheers,
Yanpei.

--
You received this message because you are subscribed to the Google Groups "SWIMapReduce-general" group.
To unsubscribe from this group and stop receiving emails from it, send an email to swimapreduce-gen...@googlegroups.com.
To post to this group, send email to swimapredu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/swimapreduce-general/5975d473-8a4d-4b0c-a949-91204610ad3f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

du Su

unread,
Nov 19, 2014, 8:57:04 PM11/19/14
to swimapredu...@googlegroups.com
Hi Yanpei,

Actually I just want to use FacebookTrace.tsv to generate workloads that are representative with my cluster. We want to test load balancing and stabilization of the cluster's power performance, so the jobs should be submitted at a proper speed, neither too fast where all blocked jobs perform like offline jobs, nor too slow where machines may have too much idle slots that effects stabilization.

If the trace is confidential, then may be I could sample from a "sampled workload" which is sampled from FaceboodTrace.tsv. Meanwhile, I read in the paper that short sampling length means high representative of data characteristics but may not be able to bound overall distribution. So here are two solutions I came up with:
1 A .tsv file sampled from FacebookTrace.tsv and with the same format of FacebookTrace.tsv is needed. I think the sample length could be 4 hrs, so that I could still guarantee a bounded distribution if we re-sample from this new trace to do rarefaction. And the overall length should be longer, say 16 hrs.
2 We manually sample from a given synthetic workload like FB-2009_samples_24_times_1hr_0.tsv. But similarly, we should have sampling length and overall length longer.

If the above solutions are still difficult, I could use the given synthetic workload, but it might be inaccurate.

And can I apply to Facebook for their trace?

Thanks for you patient answer! Looking forward to you reply.

Best,
Du

--
You received this message because you are subscribed to a topic in the Google Groups "SWIMapReduce-general" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/swimapreduce-general/oFNyUfQY9Nw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to swimapreduce-gen...@googlegroups.com.

To post to this group, send email to swimapredu...@googlegroups.com.

Yanpei Chen

unread,
Nov 21, 2014, 2:41:05 PM11/21/14
to swimapredu...@googlegroups.com
Oh I see. Sounds like you have a good understanding of the sampling mechanics in SWIM.

Here's how I would "test load balancing and stabilization of the cluster's power performance". I would actually run a workload already in the repository, say FB-2009_samples_24_times_1hr_0.tsv. Then I would resample that workload with different sampling lengths and overall lengths. This will tell you how your system behaves for the the same workload, but altered to have different amount of burstiness, and you can see what is "too fast" or "too slow" for you.

Then you can repeat the same for FB-2009_samples_24_times_1hr_1.tsv and FB-2010_samples_24_times_1hr_0.tsv, so you can see how your system behaves with different workloads.

Facebook's engineering processes have changed a lot since 2009 and 2010. If you're willing to ask Facebook, why not get their latest traces and we can add it to the SWIM repository? Let me know if you want to invest some time doing this, I can help.


du Su

unread,
Nov 21, 2014, 10:58:57 PM11/21/14
to swimapredu...@googlegroups.com
Hi Yanpei,

By saying "resample that workload" do you mean sample FB-2009_samples_24_times_1hr_0.tsv or the jobs that have run in my computer? I think it is convenient and effective. I just worry about the representativeness of the workload would be weakened by resampling.

It is a good idea, and I am considering following your suggestions. And I am also interested in ask Facebook for their latest trace. And I would definitely be willing to add it to SWIM. Can you tell me what should I do?

Best
Du 

Yanpei Chen

unread,
Nov 22, 2014, 5:56:31 PM11/22/14
to swimapredu...@googlegroups.com
Hi Du,

Let's move the discussion off thread. Please send me a direct email so I can put you in touch with my contact at Facebook.

Thanks,
Yanpei.

Reply all
Reply to author
Forward
0 new messages