Groups
Groups
Sign in
Groups
Groups
Hadoop中国用户组(CHUG)
Conversations
About
Send feedback
Help
关于HDFS存储大量小文件的话题
57 views
Skip to first unread message
panfei
unread,
Sep 27, 2012, 6:30:19 AM
9/27/12
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Hadoop中文用户组
其实只是想明确一下这句话所指的范畴,以下是我的几点理解,不一定正确,欢迎大家指正:
1. 所谓适合存储大文件,这句话应该是指文件的大小应该至少满足一个HDFS块的大小,比如,如果所有的文件都小于HDFS块的大小,则文件的个数和块的个数应该是(在replication是3的情况下)1:3的关系;这样可能造成你的namenode的内存被大量的元数据占满之后,HDFS的剩余空间还有很多。。。NameNode容易成为瓶颈?
2. 所谓适合存储大文件,对于MapReduce是比较有意义的,MR以块为单位进行计算,相比较来说批处理一个大文件流要比处理一堆小文件流要高效得多;
从其它资料上看到,mogilefs(
https://github.com/mogilefs/
)可能更适合于处理大量小文件的存储问题,不知道有没有同学对其有研究;如果有的话能否给出一些和HDFS对比。
--
不学习,不知道
lake
unread,
Oct 15, 2012, 5:58:50 AM
10/15/12
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to hado...@googlegroups.com
其实有很多办法可以解决小文件:
1.基于CombineFileInputFormat编写来解决map数量爆炸的问题
2.虚拟机重用
不能解决的问题是.. namenode的资源占用.. 大文件小文件是一样占用的..
定期对小文件做合并就ok啦
所以本质上hdfs对小文件存储还是可以的, 完全没必要发愁, 关键在于熟悉你的工具
在 2012年9月27日星期四UTC+8下午6时30分20秒,panfei写道:
周梦想
unread,
Dec 20, 2012, 3:15:14 AM
12/20/12
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to hado...@googlegroups.com
适合大文件的确是因为namenode只有一台,如果存的是小文件,namenode空间可能被耗尽或无法管理。淘宝的tfs好像对此有改进,相关论文我看到过,可以了解一下。但不清楚tfs是否开源。
mokeyj
unread,
Dec 24, 2012, 3:11:09 AM
12/24/12
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to hado...@googlegroups.com
TFS已经开源了,
http://code.taobao.org/p/tfs/wiki/intro/
,TFS对小文件的处理主要是将多个小文件合并成一个大文件(block),每个小文件有一个唯一的fileID,每个Block有一个唯一的BlockId,NameServer上只维护Block与DataServer之间的映射。fileId和Block之间的映射是通过文件名来进行的。
Reply all
Reply to author
Forward
0 new messages