processing 50M tweetID-date pairs per worker

4 views
Skip to first unread message

Ahmet Uyar

unread,
Aug 10, 2020, 2:08:50 PM8/10/20
to Twister2
Hi guys,

Now, twister2 was able to run membership finding problem with 240 workers each processing 50M tweetID-date pairs. It used local hdd disks for intermediate data storage. 

It took 8152 seconds (136 minutes) in total. 

Ahmet

Gurhan Gunduz

unread,
Aug 10, 2020, 2:20:38 PM8/10/20
to Ahmet Uyar, Twister2
Great

--
You received this message because you are subscribed to the Google Groups "Twister2" group.
To unsubscribe from this group and stop receiving emails from it, send an email to twister2+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/twister2/CAPBRfYd%2BZ6goAvtRd%2Be%3D6KoexUreEzRwao8%2BMvSgxkfKMcMkSw%40mail.gmail.com.

Supun Kamburugamuve

unread,
Aug 10, 2020, 3:09:10 PM8/10/20
to Ahmet Uyar, Twister2
Great. This is altogether 240 X 50 million = 12 Billion data partitioned and then finding the membership.

Best,
Supun..

--
You received this message because you are subscribed to the Google Groups "Twister2" group.
To unsubscribe from this group and stop receiving emails from it, send an email to twister2+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/twister2/CAPBRfYd%2BZ6goAvtRd%2Be%3D6KoexUreEzRwao8%2BMvSgxkfKMcMkSw%40mail.gmail.com.


--
Supun Kamburugamuve, PhD
Digital Science Center, Indiana University
Member, Apache Software Foundation; http://www.apache.org
E-mail: supun@apache.org;  Mobile: +1 812 219 2563


Gregor von Laszewski

unread,
Aug 10, 2020, 3:16:16 PM8/10/20
to Supun Kamburugamuve, Ahmet Uyar, Twister2
Is there a benchmark that show the access speed of the data per process/processor?

Ahmet Uyar

unread,
Aug 11, 2020, 7:38:47 AM8/11/20
to Gregor von Laszewski, Supun Kamburugamuve, Twister2
Hi Gregor,

This was not a performance testing experiment. We just want to test whether twister2 can cope with large volumes of data. There are multiple inefficient computations/communications in the designed program to test twister2 features. 

thanks,

Ahmet

Ahmet Uyar

unread,
Aug 13, 2020, 3:37:43 PM8/13/20
to Twister2
Hi guys,

I have run the membership finding problem with 240 workers each processing 50M tweetID-date pairs by saving intermediate data at hdfs. It took 3577 seconds. Hdfs seems to be much faster than using the local disk for intermediate data. Although, hdfs is also running on the same set of local disks. 

The same problem took 8152 seconds when saving intermediate data to the local disk of each machine directly. 

Ahmet

Supun Kamburugamuve

unread,
Aug 13, 2020, 6:18:54 PM8/13/20
to Ahmet Uyar, Twister2
It must be because all workers are trying to use the same disk. When going through HDFS, it controls the disk writing.

Best,
Supun..

On Thu, Aug 13, 2020 at 3:37 PM Ahmet Uyar <ahme...@gmail.com> wrote:
Hi guys,

I have run the membership finding problem with 240 workers each processing 50M tweetID-date pairs by saving intermediate data at hdfs. It took 3577 seconds. Hdfs seems to be much faster than using the local disk for intermediate data. Although, hdfs is also running on the same set of local disks. 

The same problem took 8152 seconds when saving intermediate data to the local disk of each machine directly. 

Ahmet

On Mon, Aug 10, 2020 at 9:08 PM Ahmet Uyar <ahme...@gmail.com> wrote:
Hi guys,

Now, twister2 was able to run membership finding problem with 240 workers each processing 50M tweetID-date pairs. It used local hdd disks for intermediate data storage. 

It took 8152 seconds (136 minutes) in total. 

Ahmet

--
You received this message because you are subscribed to the Google Groups "Twister2" group.
To unsubscribe from this group and stop receiving emails from it, send an email to twister2+u...@googlegroups.com.


--
Supun Kamburugamuve, PhD
Principal Software Engineer, Twister2.org
Member, Apache Software Foundation, Apache.org
Indiana University, Bloomington, USA
Reply all
Reply to author
Forward
0 new messages