Reading Large file

smileso...@gmail.com

unread,

Apr 4, 2009, 12:07:03 AM4/4/09

to

Hi all,
I have a very big file (around 4 GB) in a 32 bit machine. How can i
read the file faster? Is there any algorithm or source code available
for reading faster with proper memory management..?

Regards
Soni

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Alexander Krizhanovsky

unread,

Apr 5, 2009, 5:30:42 AM4/5/09

to

> I have a very big file (around 4 GB) in a 32 bit machine. How can i
> read the file faster? Is there any algorithm or source code available
> for reading faster with proper memory management..?

Why not to use mmap() if you run Linux/UNIX? Actually, you should't
care about size of file (only if your FS supports such files' size)
because of operating system uses virtual memory and page cache. I.e.
all inactive pages will be swapped back to the file (physical disk
blocks). But you have to check that you have PAE to use such large
address space... Otherwise you can use lseek(): off_t is 64bit type.

t...@42.sci.utah.edu

unread,

Apr 5, 2009, 5:30:42 AM4/5/09

to

smileso...@gmail.com wrote:
> Hi all,
> I have a very big file (around 4 GB) in a 32 bit machine. How can i
> read the file faster? Is there any algorithm or source code available
> for reading faster with proper memory management..?

The best way is probably to use mmap:

http://www.gnu.org/software/libtool/manual/libc/Memory_002dmapped-I_002fO.html

if you're on a GNU system. General concept:

http://en.wikipedia.org/wiki/Mmap

and finally, perhaps most importantly, the standard:

http://www.opengroup.org/onlinepubs/000095399/functions/mmap.html

Cheers,

-tom

Gil

unread,

Apr 5, 2009, 5:30:47 AM4/5/09

to

On Apr 4, 12:07 am, smilesonisa...@gmail.com wrote:
> Hi all,
> I have a very big file (around 4 GB) in a 32 bit machine. How can i
> read the file faster? Is there any algorithm or source code available
> for reading faster with proper memory management..?
>

folks, usually incomplete questions get incomplete answers.

reading large files is unfortunately still platform dependent.
depending on your platform (that is OS/compiler and file system) you
may have different ways to enable and use LFS (large file support).
For example, on Solaris 5.8 with 32 bit compiler you cannot enable LFS
iostream support. I believe it is the same on AIX 4.xx.

for Win32 LFS if you write non portable code (that'd be my first guess
judging by your question),
you'll probably end up using WINAPI ReadFile(Ex) together with
SetFilePointer(Ex)/GetFileSize(Ex).
I never used VC++ in the past 5 years so I might be wrong and newer
versions of MSVC might have truly optimized LFS iostreams on 32 bit
compile model.
for *nix world you can enable LFS by compiling with -
D_FILE_OFFSET_BITS=64.
this should force all file access calls to use the 64 bit variants so
code using (f)open, (f)read can work with LFS w/o specifically
using fread64, fwrite64 calls. some platforms may require additional
linking args.

several types could also change for example off_t.
obviously avoid to use int in place of off_t:

off_t curpos;
/*note the ugly C cast, off_t is actually off_t64 with LFS*/
curpos = lseek(fd, (off_t)0, SEEK_CUR);

now back to your question, the idea is to read large blocks from file
at once (and optimize the reading based on platform).
so best solution is still platform dependent because the best possible
way to map portions of file to physical memory is platform dependent.

for example, using mmap (I believe it's not available on Windows):

void * mem = mmap( buf, count, PROT_READ, MAP_SHARED | MAP_FIXED,
mmap_file_des, offset );

so you'll have to identify the fastest way to read blocks from file on
every platform you want to support.

there should also be open source libraries that cover most of the
platforms and support LFS.
I'm not sure how good (or if) they are optimized for every platform.

check out boost::filesystem::fstream, it should support LFS.
you should probably also check boost::iostreams::mapped_file which
offers a flexible solution to access files.

I had to deal with this problem myself a couple of years ago when
implementing a paged n-dimensional Matrix for NeoCircuit and I ended
up writing my own read_file for different platforms. so you may also
want to search for 'c++ paged containers' or 'c++ large paged
containers' because any good solution to a paged container would have
to optimize reads/writes to large files.

Jack Klein

unread,

Apr 5, 2009, 4:07:56 PM4/5/09

to

On Fri, 3 Apr 2009 22:07:03 CST, smileso...@gmail.com wrote in
comp.lang.c++.moderated:

> Hi all,
> I have a very big file (around 4 GB) in a 32 bit machine. How can i
> read the file faster? Is there any algorithm or source code available
> for reading faster with proper memory management..?

I don't know, mainly because you haven't told us how you are reading
it now. If you are reading it one character at a time, there probably
are faster methods.

You would make it much easier for people to help you if you gave us
some idea of the following:

-- The contents of the file, is it text or binary? Fixed records or
different types of data?

-- How you are currently reading the file, otherwise how can anyone
possibly know whether or not you have already found the fastest method
possible?

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://c-faq.com/
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.club.cc.cmu.edu/~ajo/docs/FAQ-acllc.html

smileso...@gmail.com

unread,

Apr 8, 2009, 5:39:40 PM4/8/09

to

On Apr 6, 1:07 am, Jack Klein <jackkl...@spamcop.net> wrote:
> On Fri, 3 Apr 2009 22:07:03 CST, smilesonisa...@gmail.com wrote in

> [ Seehttp://www.gotw.ca/resources/clcm.htmfor info about ]

> [ comp.lang.c++.moderated. First time posters: Do this! ]

Hi all,
I am using Windows VC++. I was reading a file and was reading the
binary data. I was reading a single byte each time and put into the
RAM. But it will take lot of time to complete the operation when the
file is too large(4GB). If I will read 32 bit(word) at a time and
store it and then read the next word then I will have problems in
assembling the 32 bit data into bytes again.

For exp:

union{
word* wordptr;
byte* bytePtr; (byte 8 bit)
}

Lets say I have 101 bytes in the file then 101/4 = 25 operation of
word read 1 byte can be copied separately. But I want the resultant to
be in bytes. I want suggestion for the fastest way to tackle this
problem?

Regards
Soni

--

smileso...@gmail.com

unread,

Apr 15, 2009, 11:57:28 AM4/15/09

to

{ quoted signature and banner redacted. -mod }

On Apr 6, 1:07 am, Jack Klein <jackkl...@spamcop.net> wrote:

> On Fri, 3 Apr 2009 22:07:03 CST, smilesonisa...@gmail.com wrote in
> comp.lang.c++.moderated:
>
> > Hi all,
> > I have a very bigfile(around 4 GB) in a 32 bit machine. How can i
> > read thefilefaster? Is there any algorithm or source code available
> > forreadingfaster with proper memory management..?

>
> I don't know, mainly because you haven't told us how you arereading

> it now. If you arereadingit one character at a time, there probably
> are faster methods.

Hi,
I am reading each character one at a time.The file is a ascii
file... Each octet is read into a buffer..Could you please suggest
some better method?

>
> You would make it much easier for people to help you if you gave us
> some idea of the following:
>

> -- The contents of thefile, is it text or binary? Fixed records or
> different types of data?

All are fixed records...
>
> -- How you are currentlyreadingthefile, otherwise how can anyone

> possibly know whether or not you have already found the fastest method
> possible?
>

I am reading each character at a time.

--

Frank Birbacher

unread,

Apr 27, 2009, 11:39:10 AM4/27/09

to

Hi!

I wonder, why there is no answer in the last weeks. I just stumbled over
this thread.

smileso...@gmail.com schrieb:

> Hi,
> I am reading each character one at a time.The file is a ascii
> file... Each octet is read into a buffer..Could you please suggest
> some better method?

Instead of reading one character at a time, you can use the "read"
procedure to read any number of characters at a time. Since you have a a
fixed record length, this length is a good candidate for the reading length.

Reading a file using the C++ standard iostreams:

#include <vector>
#include <istream> //for reading
#include <iostream> //console streams, cout
#include <fstream> //for files
#include <stdexcept> //some exceptions, runtime_error

static const size_t RECORD_SIZE = 100; //fixed size record length

typedef std::vector<char> RecordBuffer; //array of char

void handleRecord(RecordBuffer const& buffer)
{
// .. whatever ..
// access to chars via buffer.at(index)
std::cout << "read one record\n";
}

void readRecord(std::istream& stream, RecordBuffer& buffer)
{
buffer.resize(RECORD_SIZE);
stream.read(&buffer.front(), RECORD_SIZE);
}

void readAndHandleFile(const char* const filename)
{
std::ifstream file(filename);
if( ! file)
throw std::runtime_error("could not open file");

while(true)
{
RecordBuffer buffer;
readRecord(file, buffer);
if( ! file)
break;
handleRecord(buffer);
}
file.close();
}

int main(int argc, char* argv[])
{
try
{
if(argc>1)
readAndHandleFile(argv[1]);

return 0; //0 means successful execution
}
catch(std::exception const& e)
{
std::cout << "ERROR: " << e.what() << '\n';
}
catch(...)
{
std::cout << "UNKNOWN ERROR\n";
}
return -1; //signal error
}

HTH,
Frank