Questions

Raj

unread,

Jun 25, 2013, 11:00:28 PM6/25/13

to hadooponli...@googlegroups.com

1. How can I view how many blocks has a file been broken into, in a Hadoop file system?

2. What is the difference between Scale out and Scale up?

3. What is batch processing in Hadoop?

4. What is OLAP and OLTP?

pavan.irmc

unread,

Jun 26, 2013, 2:06:06 PM6/26/13

to hadooponli...@googlegroups.com

On Wednesday, June 26, 2013 8:30:28 AM UTC+5:30, Raj wrote:

1. How can I view how many blocks has a file been broken into, in a Hadoop file system?

You need to look in your hdfs-default.xml configuration file for the dfs.data.dir setting. The default setting is: ${hadoop.tmp.dir}/dfs/data

2. What is the difference between Scale out and Scale up?

Scaleup: Adding computational power to single computer or buying bigger machine.

3. What is batch processing in Hadoop?

OFFLINE PROCESSING

4. What is OLAP and OLTP?

the most commonly misunderstood terms is OLTP and OLAP. What are OLTP and OLAP? OLTP means Online Transaction Processing. OLAP means Online Analytical Processing. The meanings are synonymous with their names. OLTP deals with processing of data from transactional systems. For example, an application that loads the reservation data of a hotel is an OLTP system. An OLTP system is designed mainly keeping in the mind the performance of the end application. It comprises of the application, database & the reporting system that directly works on this database. The database in an OLTP system would be designed in a manner as to facilitate the improvement in the application efficiency thereby reducing the processing time of the application. For example, consider a hotel reservation application. The database of such an application would be designed mainly for faster inserts of the customer related data. It would also be designed in a manner as to get a faster retrieval of the hotel room availability information.

Such a database is part of an OLTP system. Whenever a reporting tool is made to work on such an application database then it forms the OLTP system. Generally, an OLTP system refers to the type of database a reporting tool works on. During the 1980’s, many applications were developed to cater to needs of many upcoming organizations. All the applications required a database to process, load & extract the data. The entire database was designed keeping in mind the performance of the end application. As the organizations developed, they felt the importance of analyzing the data that was collected. The analysis performed on such data resulted in alarming number of findings that helped the organizations to make important business decisions. Hence, more need was felt to develop full reporting solutions. This is the period when more & more reporting tools came into the market. But the performance of these reporting tools was very poor since they were made to extract data from a system/database that was mainly developed keeping in mind the performance of the application.

There were 2 main reasons why a DW came into being

1. Decrease in the performance of front-end applications as more & more data was collected. A need to isolate older data was felt.

2. Importance of reporting was felt in equal terms among all organizations. Existing reporting systems were poor since they had to work on existing application databases.

OLAP systems were mainly developed using data in a warehouse. Having said that a need was felt to isolate older data, it was necessary to store them in a format that would be useful in easing out the reporting bottlenecks. A need was felt to isolate the data & redesign the application data to such a format & structure that this data repository would be the prime source of business decisions. Coming back to OLAP systems, these systems were mainly developed on the isolated data. The isolated data provided a means for faster, easier & efficient reporting. OLAP system need not always do reporting from a DW. The criterion is that it must be doing reporting from a system/database that does not involve ongoing transactions. For example, some organizations create something called an ODS (Operational Data Store) which would be a replica of the transactional data. This data store would then be used for reporting. But generally OLAP is synonymous with a Data Warehouse.

Raj

unread,

Jun 26, 2013, 11:33:38 PM6/26/13

to hadooponli...@googlegroups.com

What is sequence file?

Regards,

Raj

satish Edhara

unread,

Jun 27, 2013, 12:12:36 AM6/27/13

to Raj, hadooponli...@googlegroups.com

SEQ is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats.

Note- internally, the temporary outputs of maps are stored using SequenceFile.

The SequenceFile provides a Writer, Reader and Sorter classes for writing, reading and sorting respectively.

There are 3 different SequenceFile formats:

Uncompressed key/value records.
Record compressed key/value records - only 'values' are compressed here.
Block compressed key/value records - both keys and values are collected in 'blocks' separately and compressed. Note - The size of the 'block' is configurable.

The recommended way is to use the SequenceFile.createWriter methods to construct the 'preferred' writer implementation.

The SequenceFile.Reader acts as a bridge and can read any of the above SequenceFile formats.

Ex- You can read them in the following manner:


Configuration config = new Configuration();
Path path = new Path(PATH_TO_YOUR_FILE);
SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(config), path, config);
WritableComparable key = (WritableComparable) reader.getKeyClass().newInstance();
Writable value = (Writable) reader.getValueClass().newInstance();
while (reader.next(key, value))
  // perform some operating
reader.close();

Ref - http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/SequenceFile.html

--
You received this message because you are subscribed to the Google Groups "HadoopOnlineTraining9" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hadooponlinetrai...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

PAVAN KUMAR Reddy

unread,

Jun 27, 2013, 2:48:00 AM6/27/13

to satish Edhara, Raj, hadooponli...@googlegroups.com

Great Work Satish & Raj Kudos!!!!!!!!!!

Way to go... :)

Thanks & Regards

Pavan Reddy.G

--

Thanks & Regards

Pavan Reddy .G

Raj

unread,

Jun 27, 2013, 11:16:07 PM6/27/13

to hadooponli...@googlegroups.com, satish Edhara, Raj

What is Streaming data access?

What is NameSpace?

What is maximum number of requests a DataNode can access(Max number threads)?

Regards,

Raj

To unsubscribe from this group and stop receiving emails from it, send an email to hadooponlinetraining9+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "HadoopOnlineTraining9" group.

To unsubscribe from this group and stop receiving emails from it, send an email to hadooponlinetraining9+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

ajay m

unread,

Jun 29, 2013, 8:58:20 PM6/29/13

to hadooponli...@googlegroups.com, satish Edhara, Raj

What is Streaming data access?

As HDFS works on the principle of ‘Write Once, Read Many‘, the feature of streaming access is extremely important in HDFS. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single record from the data.

What is NameSpace?

HDFS is based on an architecture where the namespace is decoupled from the data. The namespace forms the file system metadata, which is maintained by a dedicated server called the name-node. The data itself resides on other servers called data-nodes.

What is maximum number of requests a DataNode can access(Max number threads)?

The maximum size of the thread pool. When the pending request queue overflows, new threads are created until their number reaches this number. After that, the server starts dropping connections.

Default: 1000

Best Regards,
Ajay

Raj

unread,

Jul 3, 2013, 5:37:46 PM7/3/13

to hadooponli...@googlegroups.com

Hi Pavan,

What is Data Intensive?
What is Computation Intensive?

Regards,
Raj

Raj

unread,

Jul 3, 2013, 6:10:57 PM7/3/13

to hadooponli...@googlegroups.com

Hi Pawan

Questions on Installation Mode
1. What is Local Mode?
2. What is Pseudo Distributed Mode?
3. What is Distributed Mode?

Regards,
Raj

preem r

unread,

Jul 3, 2013, 6:29:52 PM7/3/13

to Raj, hadooponli...@googlegroups.com

Hi Pavan,

How to setup HDFS in personal computer with cluster examples instead of cloudera image?

Thanks,

Aruna

--

You received this message because you are subscribed to the Google Groups "HadoopOnlineTraining9" group.

To unsubscribe from this group and stop receiving emails from it, send an email to hadooponlinetrai...@googlegroups.com.

PAVAN KUMAR Reddy

unread,

Jul 3, 2013, 8:17:13 PM7/3/13

to preem r, Raj, hadooponli...@googlegroups.com

will explain when we do setting up cluster concepts..

ajay m

unread,

Jul 4, 2013, 3:52:24 PM7/4/13

to hadooponli...@googlegroups.com

What is Computation Intensive?
Compute Intensive: maximizes system resources for processing large computations for simulation.

What is Data Intensive?

Data Intensive: simplifies the challenges of working with large datasets generated by sensors and simulation results. Goal is to move computation to the data

Best Regards,
Ajay

ajay m

unread,

Jul 4, 2013, 4:00:20 PM7/4/13

to hadooponli...@googlegroups.com

1. What is Local Mode?

This mode is the default mode that you get when you’re downloading and extracting Hadoop for the first time. In this mode, Hadoop didn’t utilize HDFS to store input and output files. Hadoop just use local filesystem in its process. This mode is very useful for debugging your MapReduce code before you deploy it on large cluster and handle huge amounts of data. In this mode, the Hadoop’s configuration file triplet (mapred-site.xml, core-site.xml, hdfs-site.xml) still free from custom configuration.

2. What is Pseudo Distributed Mode?

A pseudo-distributed mode is simply a distributed mode run on a single host.
In this mode, we configure the configuration triplet to run on a single cluster. The replication factor of HDFS is one, because we only use one node as Master Node, Data Node, Job Tracker, and Task Tracker. We can use this mode to test our code in the real HDFS without the complexity of fully distributed cluster.

3. What is Distributed Mode?

Distributed mode can be subdivided into distributed but all daemons run on a single node -- a.k.a pseudo-distributed-- and fully-distributed where the daemons are spread across all nodes in the cluster.
In this mode, we use Hadoop at its full scale. We can use cluster consists of a thousand nodes working together. This is the production phase, where your code and data are used and distributed across many nodes. You use this mode when your code is ready and work properly on the previous mode.

Best Regards,
Ajay

ajay m

unread,

Jul 4, 2013, 4:12:39 PM7/4/13

to hadooponli...@googlegroups.com, Raj

Below Links provides detail instruction on setting up Hadoop

First Step: Install the Ubuntu using vmware player. Below link explains in detail

http://theholmesoffice.com/installing-ubuntu-in-vmware-player-on-windows/

Second Step: Hadoop on Ubuntu Linux (Single-Node Cluster). Below link explains in detail
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

PS: Skip First step if you have a linux OS.

Best Regards,
Ajay

To unsubscribe from this group and stop receiving emails from it, send an email to hadooponlinetraining9+unsub...@googlegroups.com.

Raj

unread,

Jul 31, 2013, 1:35:46 AM7/31/13

to hadooponli...@googlegroups.com, Raj

Hi Pavan,

Can you please tell what is the procedure to enable TRASH in Hadoop and how to set Time to that TRASH?

Regards,
Raj

PAVAN KUMAR Reddy

unread,

Jul 31, 2013, 11:40:39 AM7/31/13

to Raj, hadooponli...@googlegroups.com

Hi all.

To enable the trash feature and set the time delay for the trash removal, set the fs.trash.interval property in core-site.xml to the delay (in minutes). For example, if you want users to have 24 hours (1,440 minutes) to restore a deleted file, you should have in core-site.xml

Thanks & Regards

Pavan Reddy.G

To unsubscribe from this group and stop receiving emails from it, send an email to hadooponlinetrai...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

Raj

unread,

Aug 14, 2013, 11:03:56 PM8/14/13

to hadooponli...@googlegroups.com, Raj

Pawan two Qeustions

How indexing is done in HDFS?

If we want to copy 10 blocks from one machine to another, but another machine can copy only 8.5 blocks, can the blocks be broken at the time of replication?

Thanks,

Jadhav Raj

To unsubscribe from this group and stop receiving emails from it, send an email to hadooponlinetraining9+unsubscri...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "HadoopOnlineTraining9" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hadooponlinetraining9+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

PAVAN KUMAR Reddy

unread,

Aug 15, 2013, 11:40:56 AM8/15/13

to Raj, hadooponli...@googlegroups.com

1: indexing is done by name node ( by keeping the block information)

2 : Blocks never be broken or splitted, it has to be 64 mb.

To unsubscribe from this group and stop receiving emails from it, send an email to hadooponlinetrai...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

PAVAN KUMAR Reddy

unread,

Aug 15, 2013, 12:12:17 PM8/15/13

to Jadhav, hadooponli...@googlegroups.com

Hey

There is no page ranking concepts in HDFS

Page Ranking is nothing but an algorithms used by search engine to prioritize things.

On Thu, Aug 15, 2013 at 9:12 PM, Jadhav <trini...@gmail.com> wrote:

Sorry Pavan it is not indexing.... it is ranking

Raj

unread,

Aug 17, 2013, 9:30:29 PM8/17/13

to hadooponli...@googlegroups.com, Jadhav

How hadoop reads input from stdin and writes to destination filesystem?
What is the differece between -copyFromLocal and -put?
What is maximum value replication factor has?
What is the difference between -touchz and -mkdir?
Not able to run command "bin/hadoop dfs -test -e/d/z /testfolder/file1/" - No output is coming. Can you please provide some scenario for testing.

Thanks,
Raj

Pawan two Qeustions

Thanks & Regards
Pavan Reddy .G

--
You received this message because you are subscribed to the Google Groups "HadoopOnlineTraining9" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hadooponlinetraining9+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

Thanks & Regards
Pavan Reddy .G

Reply all

Reply to author

Forward