Hypertable architecture

HyperUser

unread,

May 24, 2013, 5:33:19 PM5/24/13

to hyperta...@googlegroups.com

I am creating a simple setup.

4 Machines, easy to manually manage so I didn't use Cap.

SYS-01 - 4 cpu (8core each), 128gb RAM

Hadoop HDFS - namenode, datanode, jobtracker, tasktracker

Hypertable - Master, Hyperspace, Rangeserver(?)

SYS-02 - 4 cpu (8core each), 128gb RAM

Hadoop HDFS - secondary namenode, datanode

Hypertable - Hyperspace

SYS-03 - 4 cpu (8core each), 128gb RAM

Hadoop HDFS - datanode

Hypertable - Hyperspace

SYS-04 - 4 cpu (8core each), 128gb RAM

Hadoop HDFS - datanode

Hypertable - Hyperspace

The objective is to run clustered Hypertable on all these four systems. SYS-01 is hypertable master. All four servers run a single database, that gets shared across all servers. I am not looking for any redundancy of data.

What hypertable processes I shall on other servers?

Do I need range server? What server I shall run range server on?

Can I use each machine to read/write queries? or just master I have to use?

Do I need to run thriftbroker on all servers?

HdfsBroker.fs.default.name=hdfs:/\/SYS-01:9000

root@SYS-01:/opt/hypertable/0.9.7.6/bin# ./start-all-servers.sh hadoop

DFS broker: available file descriptors: 1024

Started DFS Broker (hadoop)

Started Hyperspace

Started Hypertable.Master

/proc/sys/vm/swappiness = 60

Started Hypertable.RangeServer

Started ThriftBroker

Capfile (though its not in use).

set :source_machine, "SYS-01"

set :install_dir, "/opt/hypertable"

set :hypertable_version, "0.9.7.6"

set :default_pkg, "/tmp/hypertable-0.9.7.6-linux-x86_64.deb"

set :default_dfs, "hadoop"

set :default_distro, "cdh3"

set :default_config, "/opt/hypertable/dev-hypertable.cfg"

role :source, "SYS-01"

role :master, "SYS-01"

role :hyperspace, "SYS-01", "SYS-02", "SYS-03", "SYS-04"

role :slave, "SYS-01", "SYS-02", "SYS-03", "SYS-04"

role :localhost, "SYS-01"

role :thriftbroker_additional, "SYS-01", "SYS-02", "SYS-03", "SYS-04"

root@SYS-01:/hyperspace# sudo -u hdfs hadoop dfsadmin -report

Configured Capacity: 2377018916864 (2.16 TB)

Present Capacity: 2174557945856 (1.98 TB)

DFS Remaining: 2174556585984 (1.98 TB)

DFS Used: 1359872 (1.3 MB)

DFS Used%: 0%

Under replicated blocks: 0

Blocks with corrupt replicas: 0

Missing blocks: 0

-------------------------------------------------

Datanodes available: 4 (4 total, 0 dead)

Name: 10.15.21.148:50010

Decommission Status : Normal

Configured Capacity: 594254729216 (553.44 GB)

DFS Used: 65536 (64 KB)

Non DFS Used: 50588594176 (47.11 GB)

DFS Remaining: 543666069504(506.33 GB)

DFS Used%: 0%

DFS Remaining%: 91.49%

Last contact: Fri May 24 17:23:18 EDT 2013

Name: 10.15.21.173:50010

Decommission Status : Normal

Configured Capacity: 594254729216 (553.44 GB)

DFS Used: 73728 (72 KB)

Non DFS Used: 50589900800 (47.12 GB)

DFS Remaining: 543664754688(506.33 GB)

DFS Used%: 0%

DFS Remaining%: 91.49%

Last contact: Fri May 24 17:23:17 EDT 2013

Name: 10.15.21.202:50010

Decommission Status : Normal

Configured Capacity: 594254729216 (553.44 GB)

DFS Used: 45056 (44 KB)

Non DFS Used: 50589016064 (47.11 GB)

DFS Remaining: 543665668096(506.33 GB)

DFS Used%: 0%

DFS Remaining%: 91.49%

Last contact: Fri May 24 17:23:18 EDT 2013

Name: 10.15.21.242:50010

Decommission Status : Normal

Configured Capacity: 594254729216 (553.44 GB)

DFS Used: 1175552 (1.12 MB)

Non DFS Used: 50693459968 (47.21 GB)

DFS Remaining: 543560093696(506.23 GB)

DFS Used%: 0%

DFS Remaining%: 91.47%

Last contact: Fri May 24 17:23:17 EDT 2013

HyperUser

unread,

May 24, 2013, 5:39:00 PM5/24/13

to hyperta...@googlegroups.com

Here is picture of this.

HS-sys.jpg

ddorian

unread,

May 25, 2013, 5:57:12 AM5/25/13

to hyperta...@googlegroups.com

What hypertable processes I shall on other servers?

On each datanode you must have a rangeserver.

Do I need range server? What server I shall run range server on?

See above. When you query you get the data from the rangeservers. Also for writes.

Can I use each machine to read/write queries? or just master I have to use?

The thriftbroker knows which server to query depending on the row-key. It gets this info (where each range is from the master)

Do I need to run thriftbroker on all servers?

Run thriftbroker on each application server.

I am not looking for any redundancy of data.

That you can do it on the configuration file, table creation etc.

HyperUser

unread,

May 25, 2013, 7:23:55 PM5/25/13

to hyperta...@googlegroups.com

In conclusion, I run these processes -

SYS-01 - master, datanode/hyperspace, thriftbroker and range server

SYS-02 - datanode/hyperspace, thriftbroker and range server

SYS-03 - datanode/hyperspace, thriftbroker and range server

SYS-04 - datanode/hyperspace, thriftbroker and range server

Since all servers will be used to add and read data for an external application.

ddorian

unread,

May 26, 2013, 7:20:55 AM5/26/13

to hyperta...@googlegroups.com

Why 4 hyperspaces? 3 are enough

The application connects with which thriftbroker?
Is the application in another server? (if yes install thriftbroker on that server and remove it from the others)

About the master and rangeserver together I don't know if it's a good idea or not.

Out of curiosity what kind of disk setup do these servers have or are you looking for only in-memory data?

On Friday, May 24, 2013 11:33:19 PM UTC+2, HyperUser wrote:

Message has been deleted

HyperUser

unread,

May 26, 2013, 3:51:38 PM5/26/13

to hyperta...@googlegroups.com

Application, DFS and hypertable will run on all four machines.

Application shall be able to use all four machines to read and write queries. I don't have a load balancer at this time, so I am manually load balancing from design by pointing applications to individual servers' IPs.

All machines create one DFS cluster, around 2Tb. Each machine has 500Gb RAID 10, with 10k speed.

All machines create one distributed database cluster (hyperspace), data is stored on DFS across all servers. One hyperspace across all servers, one namespace, multiple tables that can be read and written from all machines. Not looking for any replication or redundancy of data.

SYS-01 - master, DFS broker, hyperspace process to join the cluster, DFS datanode so range server process, thriftbroker process for read/write

SYS-02 - DFS broker, hyperspace process to join the cluster, DFS datanode so range server process, thriftbroker process for read/write

SYS-03 - DFS broker, hyperspace process to join the cluster, DFS datanode so range server process, thriftbroker process for read/write

SYS-04 - DFS broker, hyperspace process to join the cluster, DFS datanode so range server process, thriftbroker process for read/write

Does it sound right?

dorian i

unread,

May 26, 2013, 4:05:29 PM5/26/13

to hyperta...@googlegroups.com

So the application connects on the thriftbroker that is on localhost?

Usually raid is not used in hypertable/hbase/hdfs/cassandra. And raid is a kind of redundancy & avaibility.

If you remove raid and set replication=2 , you will recover data even after a machine loss. If you just remove the raid you will get faster cluster.

--
You received this message because you are subscribed to a topic in the Google Groups "Hypertable User" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/hypertable-user/FcKGgt4I4dw/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to hypertable-us...@googlegroups.com.
To post to this group, send email to hyperta...@googlegroups.com.
Visit this group at http://groups.google.com/group/hypertable-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

HyperUser

unread,

May 26, 2013, 4:16:53 PM5/26/13

to hyperta...@googlegroups.com

Yes, applications are run locally on all four machines along with hadoop dfs and hypertable. For writing the data I am manually using the local IP to avoid delay. And for reading I will again use the local IP but it shall still be able to get data written from other machines since it is a single hyperspace DB.

How do I use replication? That is interesting, I am not particular about using RAID. In fact if I avoid RAID 10 then it will actually double my usable disk space. I chose RAID 10 since it has good read/write performance. I can do RAID 0, as just one flat drive. This also has good read/write performance.

dorian i

unread,

May 26, 2013, 4:29:45 PM5/26/13

to hyperta...@googlegroups.com

Don't do raid at all. You instruct hypertable on the level of replication for default_data(in the config, i think default is 3)/table creation(see the hql reference page)/access_group(just like table creation).

Hypertable tells hdfs to replicate data and hdfs does.

The easiest option is at table_creation time. Also don't do raid 0 (if you lost 1 disk you loose all disks data).

HyperUser

unread,

May 26, 2013, 4:39:31 PM5/26/13

to hyperta...@googlegroups.com

Hmmm. The server hardware has a raid controller, I have to make either raid 0, 1, 5 or 10 volume. If I make raid 0 then each machine will have a single drive of 1.2Tb. that will give me more usable space. It wont let me use individual disks. I can't avoid making raid.

My application will have a lot of read/write load, I don't want to lose performance on disk level. I would prefer some replication on hypertable, if I skip raid 10.

dorian i

unread,

May 27, 2013, 10:45:39 AM5/27/13

to hyperta...@googlegroups.com

There has to be a way to disable the raid controller.
If not then don't use raid5.
Use raid0 if you have absolutely no problem for the hole machine data loss when you loose 1 disk.

Use raid1 for mirroring or raid10 for mirror+striping.

Reply all

Reply to author

Forward