CloudTW Meetup-0 小記

29 views
Skip to first unread message

李柏鋒 (Pofeng Lee)

unread,
Aug 29, 2010, 8:34:03 AM8/29/10
to cloudtw
感謝大家熱情參與 CouldTW Meetup-0

小弟心直口快, 得罪不少前輩請包涵。

送上小記,歡迎大家自助修改,感謝。

http://groups.google.com/group/cloudtw/web/meetup-0

CloudTW Meetup-0 小記

* 可能分享的主題
 * cclien: storage ?
 * JamesChen: 等美國回來, 想分享 apache.org 的經驗
 * mjpan: ?
   
* 將來希望研究的主題
 * SamWu/mjpan/pofeng 虛擬機管理, ( maybe a small workshop with openfoundry )
 * Mr. 王堯坡         VM 與 VM 之間的網路連結, layer-2 switch ?
 * Mr. 崔殷豪 (Ethan) real-time hadoop(?)
 * Dr. 田智青         新的軟體解決方案(?)

* 討論摘要
 * Eyclyptus 舊版的極限是 30-50 台, 瓶頸在 controller
 * Eyclyptus 對到 Amazon S3 的元件是 Walorus
 * Miss 葉 希望除了研究技術之外, 還是要考慮到商業應用的層面。
 * Rex: 圖狀結構的資料, 如何用 Map Reduce 解 ? 還是要重寫新的元件 ?  


待決事項
1. meetup-1 時間 ? 星期二, 星期三, 星期日 ?
2. meetup-0 地點 ? Mix ? 果子 ? Trend 辦公室 ?
3. meetup-0 主題 ?


--
Pofeng "informer" Lee, 李柏鋒, pofeng at gmail dot com

James Chen

unread,
Aug 29, 2010, 9:03:34 AM8/29/10
to clo...@googlegroups.com
Thanks Michael, Pofeng,

抱歉今天小女一直亂亂亂~~~

小弟預計Nov 1~5會去參加ApacheCon NA 2010. http://na.apachecon.com/c/acna2010/
會議中主要的技術會以apache.org的open source project為主.

- Cassandra/NoSQL
- Content Technologies
- (Java) Enterprise Development
- Felix/OSGi
- Geronimo
- Hadoop + friends/Cloud Computing
- Lucene, Mahout + friends/Search
- Tomcat
- Tuscany

我想回來後可以整理一些心得跟大家分享.
個人對Hadoop的應用,以後大量資料(Big Data)的技術非常有興趣.
NoSQL以Hbase為主. 我們公司的Architect, Andrew Purell是Hbase的comitter, 
如果大家有Hadoop/Hbase相關的問題, 可以多交流. 

Jazz Yao-Tsung Wang

unread,
Aug 29, 2010, 3:17:04 PM8/29/10
to CloudTW
很抱歉無法參加聚會,這裡提供一些小弟知道的訊息:

>  * cclien: storage ?

目前可供儲存虛擬化的解決方案有 Ceph, Lustre, GPFS (IBM 有放 GPL 版本)
不過個人建議應該要考慮 replication 支援以及 deduplication 支援。
目前支援 deduplication 的檔案系統有 (1) ZFS (2) lessfs (3) SDFS

>  * Mr. 王堯坡         VM 與 VM 之間的網路連結, layer-2 switch ?

可參考今年剛 release 的 Open vSwitch - http://openvswitch.org/

>  * Mr. 崔殷豪 (Ethan) real-time hadoop(?)

由於 Hadoop 本身架構屬於 Share Nothing 架構,因此不方便作參數傳遞,
一直覺得有機會的話,可以結合 Apache ActiveMQ 來改善 Hadoop 一些即時性的問題。

>  * Dr. 田智青         新的軟體解決方案(?)
> * Miss 葉 希望除了研究技術之外, 還是要考慮到商業應用的層面

除了虛擬化以外,目前多數在談像是 BI 或者一些大型數據的分析。

>  * Rex: 圖狀結構的資料, 如何用 Map Reduce 解 ? 還是要重寫新的元件 ?

Rex 想問的是 Graph 演算法 ( Distributed Graph Algorithms ) ??

- Jazz

Jazz Yao-Tsung Wang

unread,
Aug 29, 2010, 3:20:31 PM8/29/10
to CloudTW
> >  * Rex: 圖狀結構的資料, 如何用 Map Reduce 解 ? 還是要重寫新的元件 ?
> Rex 想問的是 Graph 演算法 ( Distributed Graph Algorithms ) ??

X-RIME 這個專案有使用 Hadoop 作 Social Network 分析,應該有用到 Graph 演算法,
不妨參考看看。
http://xrime.sourceforge.net

- Jazz

fr3@K

unread,
Aug 29, 2010, 9:40:29 PM8/29/10
to clo...@googlegroups.com
2010/8/30 Jazz Yao-Tsung Wang <jazzw...@gmail.com>:

Don't use AMQ, it is way too over promoted/rated. My team had use it
in production for quite a while, and got bitten very badly.

We've done a comprehensive evaluation on AMQ, Open MQ and a couple of
proprietary MQs. In the evaluation (and also in our production
environment), we peered a number of brokers (of the same MQ,
obviously) for the purpose of HA. AMQ failed miserably, In particular,
scenarios involving stress or errors (i.e., pulling the network cable
and etc.). It fails so bad that you won't believe it is from Apache at
first.

One thing worth highlighting, the evaluation was performed by a team
who had been operating AMQ in production for more than 6 months. They
may not be the most knowledge people out there regarding AMQ, but they
sure know a thing or two.

AMQ MAY work, if you don't need HA (peering MQs into a cluster).

I am particular intrigued by the quality of its source code. (I've
blogged about its C++ client library in the past
http://fsfoundry.org/codefreak/2009/03/08/code-review-activemq-cpp/).
I've witnessed a committer checked in a fix for a CRITICAL bug that
exists in a major release with no commit log, no matter how you look
at it, it is BAD.

Flee, people. Get the hell away from AMQ.

Best,
- fr3@K

michael

unread,
Aug 29, 2010, 9:58:13 PM8/29/10
to CloudTW
I just wanted to respond to fr3@k's comments, and make some
clarifications.

> Don't use AMQ, it is way too over promoted/rated. My team had use it
> in production for quite a while, and got bitten very badly.

AMQP is a protocol specification. People are excited about that,
because it is a well designed protocol that simplifies messaging at
the application. For example, applications using AMQP no longer have
to worry about the raw data coming out of sockets, and only about the
actual messages themselves.

> We've done a comprehensive evaluation on AMQ, Open MQ and a couple of
> proprietary MQs. In the evaluation (and also in our production
> environment), we peered a number of brokers (of the same MQ,
> obviously) for the purpose of HA. AMQ failed miserably, In particular,
> scenarios involving stress or errors (i.e., pulling the network cable
> and etc.). It fails so bad that you won't believe it is from Apache at
> first.

ApacheMQ, OpenMQ, OpenAMQ, Qpid, and RabbitMQ are just some of the
(more widely known) implementations of the protocol AMQP. Not all
implementations (of any design) are well implemented, but it doesn't
mean that if one (or a few) are badly implemented, it's a bad idea.
True, there is no built in High Availability (HA) in the AMQP
implementations, but HA is an orthogonal capability to messaging. As
a design decision, HA should be a comprehensive strategy devised for a
particular deployment, and not something to be added piecemeal to
individual applications by developers who have no idea how their
application will be used.

In a previous life at DreamWorks Animation, we were able to work with
RedHat on their MRG product (http://www.redhat.com/mrg/) which
incorporates AMQP coupled with HA capabilities that MRG also
provides. That scaled to thousands of machines over multiple
renderfarms (I can't give specific numbers) worldwide.

So, my opinion is different than that of fr3@k's. AMQP can be a
powerful and useful tool, but only if deployed wisely and used well.

Cheers
Mike

Jazz Yao-Tsung Wang

unread,
Aug 29, 2010, 10:10:29 PM8/29/10
to CloudTW
Thanks to fr3@k's comments, I also like to follow michael's
suggestion.
AMQP is an open standard to financial industry.

In last post, I just take ActiveMQ as an example. I don't like it
either.
It's too complex for me :P I personal tried ZeroMQ once.
According to google trend, RabbitMQ might be the best choice.

- Jazz

fr3@K

unread,
Aug 30, 2010, 12:59:12 AM8/30/10
to clo...@googlegroups.com

I used the term "AMQ" as a shorthand for ActiveMQ (an implementation),
but not AMQP (a specification).

Best,
- fr3@K

fr3@K

unread,
Aug 30, 2010, 1:02:59 AM8/30/10
to clo...@googlegroups.com

RedHat don't service MGR in Asia, we tried.

Best,
- fr3@K

Ethan Yin-Hao Tsui

unread,
Aug 30, 2010, 10:56:31 PM8/30/10
to CloudTW
Hi, James,
很期待你回來之後的心得分享...
我也好想去阿...><"


On Aug 29, 9:03 pm, James Chen <chaoyu0...@gmail.com> wrote:
> Thanks Michael, Pofeng,
>
> 抱歉今天小女一直亂亂亂~~~
>

> 小弟預計Nov 1~5會去參加ApacheCon NA 2010.http://na.apachecon.com/c/acna2010/

Ethan Yin-Hao Tsui

unread,
Aug 30, 2010, 11:04:35 PM8/30/10
to CloudTW
Hi Jazz,
謝謝你提出可以考慮採用AMQ的方法來處理,我這邊也會再研究看看,
目前我這邊是已經著手規劃改寫hadoop的source code,來符合我需要的功能,
能不能成功,我現在也不知道,只能盡力而為力求實作一版prototype,
但是我會盡量減少參數傳遞的可能性,不過目前還在架構規劃的階段,應該這一兩個禮拜就會進行開發,
我們會保留HDFS的部份,hadoop環境設定, 保留job scheduling的部份(有必要的話可能也會自己改寫),剩下的
TaskTracker, JobTracker, JobClient,...我都會先拿掉,還有相容用shell也可以執行的部份也會拿掉,還有
message log的部份也會拿掉, 就是力求"快",
有什麼好結果再與大家分享,
謝謝

Ethan

On Aug 30, 3:17 am, Jazz Yao-Tsung Wang <jazzwang...@gmail.com> wrote:
> 很抱歉無法參加聚會,這裡提供一些小弟知道的訊息:
>
> >  * cclien: storage ?
>
> 目前可供儲存虛擬化的解決方案有 Ceph, Lustre, GPFS (IBM 有放 GPL 版本)
> 不過個人建議應該要考慮 replication 支援以及 deduplication 支援。
> 目前支援 deduplication 的檔案系統有 (1) ZFS (2) lessfs (3) SDFS
>
> >  * Mr. 王堯坡         VM 與 VM 之間的網路連結, layer-2 switch ?
>

> 可參考今年剛 release 的 Open vSwitch -http://openvswitch.org/

Ethan Yin-Hao Tsui

unread,
Aug 30, 2010, 11:07:18 PM8/30/10
to CloudTW
謝謝Fr3@k, Michael, Jazz,
對AMQ精闢的說明與比較...
小弟對這個不熟,還需要再看一下,相關文件,
如果有的話,大家有沒有推薦不錯的說明,tutorial等,讓小弟看的快一點... (能拿望遠鏡看到各位大大的車尾燈...)
謝謝,

Ethan

pingooo

unread,
Aug 31, 2010, 12:35:10 AM8/31/10
to CloudTW
<道聽塗說>
力求快的話,是否考慮了解一下 Sector/Sphere?
http://sector.sourceforge.net/

我沒用過,但據說用 C++ 開發的 Sector/Sphere 比 Hadoop 快不少。
</道聽塗說>

list 上有高手用過的話,請不吝分享心得。

李柏鋒 (Pofeng Lee)

unread,
Aug 31, 2010, 12:45:50 AM8/31/10
to clo...@googlegroups.com
嗯,他們的資料看起來是 2-4 倍快

http://sector.sourceforge.net/benchmark.html

to Ethan, Map Reduce 小弟知道有三套, 參考一下 

Hadoop (Yahoo) http://hadoop.apache.org/ Java
Sphere (UIC) http://sector.sourceforge.net/ C++
Disco (Nokia) http://discoproject.org/ Erlang

2010/8/31 pingooo <ping.n...@gmail.com>

Jazz Yao-Tsung Wang

unread,
Sep 1, 2010, 12:08:00 AM9/1/10
to CloudTW
> to Ethan, Map Reduce 小弟知道有三套, 參考一下
>
> Hadoop (Yahoo)http://hadoop.apache.org/Java
> Sphere (UIC)http://sector.sourceforge.net/C++
> Disco (Nokia)http://discoproject.org/Erlang

我這裡有整理了一份 MapReduce 不同語言的實作清單
http://trac.nchc.org.tw/grid/wiki/jazz/09-04-14#MapReduce

- Jazz

* R
o The R-Project and Map Reduce
o http://ml.stat.purdue.edu/rhipe/ - Wow!! RHIPE - R and
Hadoop Integrated Processing v.0.1 這兩個的結合真是符合我們目前的方向啊!!!
o http://cran.r-project.org/web/packages/mapReduce/ - R 官方的
mapReduce 套件 mapReduce - flexible mapReduce algorithm for parallel
computation
o 更神奇的是 Amazon Web Service 也有支援 R 呢!!

Develop your data processing application authored in your
choice of Java, Ruby, Perl, Python, PHP, R, or C++.

* Java
o GridGain - Java 寫的 MapReduce Framework
o Hive - 架構在 Hadoop 之上,由 facebook 主導的專案
o Cloud MapReduce - A MapReduce implementation on Amazon
Cloud OS
* C/C++
o Phoenix
+ 2007/3/1 上傳的演講 - Evaluating MapReduce for Multi-core
and Multiprocessor Systems
+ 投影片
+ 演講影片
o Galago TupleFlow
* Perl
o Parallel::MapReduce
o PlasmaFS - implements the map/reduce framework on a
compute cluster
* Python
o FileMap - 原始碼
o Disco - 核心用 Erlang 寫的,Job 管理可以用 Python 撰寫。
o dumbo - 跟 Hadoop 的關聯性非常強,因為這個專案就是 Hadoop Stream 裡的 Python 實

o Prince - API for Hadoop/MapReduce in Python, 2010
(2010-05-12)
o octopy - Easy MapReduce for Python (2010-08-24)
o httpmr - A scalable data processing framework for people
with web clusters. (2010-08-24) - 架在 Google App Engine 之上
o misco - A Mobile MapReduce Framework
* Ruby
o Skynet
o Starfish - Open source Ruby implementation
o mapredus - simple mapreduce framework using redis and
resque (2010-08-24)
* Erlang
o Riak : An Open Source Internet-Scale Data Store
* CUDA
o Mars - A MapReduce Framework on Graphics Processors - 如果要用
GPU 來算 MapReduce 的話,可以用 Mars
* Qt
o QtConcurrent
+ Open Source C++ MapReduce (non-distributed)
implementation from Trolltech
+ 網頁寫說適用於 shared-memory (non-distributed) systems。
* bash
o Mapreduce Bash Script - 用 bash shell script 寫的 MapReduce -
原始碼
* JavaScript
o Collaborative Map-Reduce in the Browser - 這個實作所要提倡的精神有點類似
SETI@Home,也就是希望藉由群眾的力量,來打造以 HTTP 為標準的分散式叢集。
* .NET
o Qizmt - MySpace just released a MapReduce framework
for .NET called Qizmt as an open source project. - 簡介影片 - 原始碼下載
o Dryad - DryadLINQ (2010-08-24)
o http://mapsharp.codeplex.com/ (2010-08-31)
o http://code.google.com/p/hadoopdotnet/ (2010-08-31)
* MPI
o MapReduce-MPI Library - (2010-05-16: MPI-based MapReduce
Implementation)
* MySQL
o Gearman - Map/Reduce and Queues for MySQL Using Gearman
(Video)

* http://mapreduce.net/
Reply all
Reply to author
Forward
0 new messages