Streaming fs.readdir?

541 views
Skip to first unread message

Aseem Kishore

unread,
Oct 7, 2014, 9:49:52 PM10/7/14
to nod...@googlegroups.com
Hi there,

I have a directory with a very large number of files in them (over 1M). I need to process them, so I'm using fs.readdir() on the directory.

The problem is, fs.readdir() returns everything at once, causing my script to suddenly consume >1 GB of RAM.

AFAICT, there's no way to stream this list instead of returning it all at once. Is there anything equivalent that I can do?

Thanks!

Aseem

Forrest Norvell

unread,
Oct 7, 2014, 10:50:27 PM10/7/14
to nod...@googlegroups.com

I know there’s been some talk about adding streaming readdir() to libuv recently, but that’s of little use to you now. There are also some modules (like Thorsten Lorenz’s readdirp) that have streaming interfaces, but that only works if you want a recursive readdir(); it’s no help if you’ve got one enormous directory (like you do). Unless you want to write your own native module that follows the approach described in that libuv thread, I think you’re out of luck. Sorry! :/

F


--
Job board: http://jobs.nodejs.org/
New group rules: https://gist.github.com/othiym23/9886289#file-moderation-policy-md
Old group rules: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
---
You received this message because you are subscribed to the Google Groups "nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nodejs+un...@googlegroups.com.
To post to this group, send email to nod...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nodejs/CAETHeihHkaQAbj9Ug5MMeNR6MHPR%3DfV21F%3DaMYMHQ-cNsQakPw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

HackerOUT

unread,
Oct 8, 2014, 2:08:12 AM10/8/14
to nod...@googlegroups.com
Hi There, 

Not the perfect solution but it works for your case, based on your system 
do something like this : 

var spawn = require('child_process').spawn,
      ls    = spawn('cmd', ['/c', 'dir /B directoryname']);  // if on windows, use ls if on *nix 
ls.stdout.on('data', function (data) {
  ls.stdout.pause(); // you can ignore all this together and use pipe instead
  //console.log('stdout: ' + data);
  // split lines and process the file
  // when done call ls.stdout.resume(), here i put it inside a setTimeout just to simulate your async process of a file
  setTimeout(function(){ls.stdout.resume();},500);
});

you need to listen for events for stderr ls.on('close ....

Cheers

Lorenzo Giuliani

unread,
Oct 8, 2014, 5:25:26 AM10/8/14
to nod...@googlegroups.com
as a quick hack you could stream the output of `ls`

Floby

unread,
Oct 8, 2014, 6:13:12 AM10/8/14
to nod...@googlegroups.com
You could hack something spawning a `ls` command et reading its stdout output.


spawnLS(my_dir_path).stdout.pipe(split()).on('data', function (name) {
  // name is the file or dir listed in your main directory
})

Matt

unread,
Oct 8, 2014, 10:29:23 AM10/8/14
to nod...@googlegroups.com
You could stream from a child process written in another language. For example you could do 'opendir(DIR,$ARGV[1]); print "$_\n" while $_=readdir(DIR)' as a perl script (with the directory as the first command line argument).

Matt

unread,
Oct 8, 2014, 11:39:34 AM10/8/14
to nod...@googlegroups.com
Please note that the suggestions to shell out to "ls" are dangerous as "ls" has a bunch of default options (some set by the environment) and will by default try and sort (reading the whole directory into memory).

At the minimum you want the "-1 -f" flags set.

--
Job board: http://jobs.nodejs.org/
New group rules: https://gist.github.com/othiym23/9886289#file-moderation-policy-md
Old group rules: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
---
You received this message because you are subscribed to the Google Groups "nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nodejs+un...@googlegroups.com.
To post to this group, send email to nod...@googlegroups.com.

Aria Stewart

unread,
Oct 8, 2014, 11:58:43 AM10/8/14
to nod...@googlegroups.com

On Oct 8, 2014, at 11:34 AM, Matt <hel...@gmail.com> wrote:

> Please note that the suggestions to shell out to "ls" are dangerous as "ls" has a bunch of default options (some set by the environment) and will by default try and sort (reading the whole directory into memory).
>
> At the minimum you want the "-1 -f" flags set.


Or `find . -maxdepth 1 -mindepth 1`

And you can null-separate that way.

Aseem Kishore

unread,
Oct 8, 2014, 5:52:54 PM10/8/14
to nod...@googlegroups.com
Thanks Forrest for the confirmation! I'll subscribe.

And thanks all for the streaming `ls [-f]` suggestion. I should have mentioned I had tried that, but it doesn't really solve this problem.

The reason is because unless you are processing each file instantly, the backpressure is going to build up, as `ls` keeps outputting names but your app isn't processing them -- so the memory usage is going to build up inside Node either way. Pausing the stream only pauses the output stream, causing Node to buffer up the data; it can't magically pause `ls` execution.

At least, that's my understanding. Happy to be corrected.

Aseem

Matt

unread,
Oct 9, 2014, 10:31:01 AM10/9/14
to nod...@googlegroups.com
The buffer size of a pipe between your process and ls should fill up eventually and ls's calls to write to stdout should eventually block not allowing ls to continue until you've consumed the buffer.

At least that's my theory - maybe you aren't applying backpressure right?

--
Job board: http://jobs.nodejs.org/
New group rules: https://gist.github.com/othiym23/9886289#file-moderation-policy-md
Old group rules: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
---
You received this message because you are subscribed to the Google Groups "nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nodejs+un...@googlegroups.com.
To post to this group, send email to nod...@googlegroups.com.

Alain Mouette

unread,
Oct 9, 2014, 11:53:23 AM10/9/14
to nod...@googlegroups.com

Bruno Jouhier

unread,
Oct 9, 2014, 3:27:05 PM10/9/14
to nod...@googlegroups.com
Hi Aseem,

You could use ez-streams 's ez.devices.file.list(path) function. It will give you a stream of file names over a directory tree. Unfortunately it won't really stream over the content of a single dir because there is no native API to support this (it will just do virtual streaming over an array of names in memory in this case). 

If you are generating the files you could arrange to dispatch them in subdirectories. But if they are generated by something you don't control you are probably toasted.

You may also run into perf problem with the file system itself. How well does it cope with 1M+ entries per directory?

Bruno

Matt

unread,
Oct 9, 2014, 4:56:47 PM10/9/14
to nod...@googlegroups.com

On Thu, Oct 9, 2014 at 3:27 PM, Bruno Jouhier <bjou...@gmail.com> wrote:
You may also run into perf problem with the file system itself. How well does it cope with 1M+ entries per directory?

Answering that depends on the filesystem used. Some struggle. Some are OK.

julien...@joyent.com

unread,
Oct 9, 2014, 7:01:36 PM10/9/14
to nod...@googlegroups.com
Hi!

For those interested, you might want to follow what's going on in https://github.com/joyent/libuv/pull/1521 and https://github.com/joyent/libuv/issues/1430.
If we can get this in libuv, then we'll be able to work on adding a streaming API for reading directories in Node.js.

Julien


On Tuesday, October 7, 2014 6:49:52 PM UTC-7, Aseem Kishore wrote:

Bruno Jouhier

unread,
Oct 10, 2014, 2:54:27 AM10/10/14
to nod...@googlegroups.com
Here is a simple solution with ez-streams and child_process:

var ez = require('ez-streams');
var cp = require('child_process');

var reader = ez.devices.child_process.reader(cp.spawn('ls'), {
encoding: "utf8"
}).transform(ez.transforms.lines.parser())

// just check that it works
reader.pipe(_, ez.devices.console.log);

It will handle the backpressure on the child process standard output.

Bruno

Floby

unread,
Oct 10, 2014, 4:42:49 AM10/10/14
to nod...@googlegroups.com
actually, if your implementation of what you want to do on each file is a correctly implemented writable stream, then backpressure will be handled for you.


On Wednesday, 8 October 2014 03:49:52 UTC+2, Aseem Kishore wrote:

Matt

unread,
Oct 10, 2014, 11:24:32 AM10/10/14
to nod...@googlegroups.com

On Fri, Oct 10, 2014 at 2:54 AM, Bruno Jouhier <bjou...@gmail.com> wrote:
var ez = require('ez-streams');
var cp = require('child_process');

var reader = ez.devices.child_process.reader(cp.spawn('ls'), {
encoding: "utf8"
}).transform(ez.transforms.lines.parser())

// just check that it works
reader.pipe(_, ez.devices.console.log);

It will handle the backpressure on the child process standard output.

Don't forget the -f flag :)

Frank Lemanschik

unread,
Dec 22, 2018, 11:55:30 PM12/22/18
to nodejs

i have created a nice wrapper around the nativ Operating System Find Command that works on mac, linux, windows, and is returning a Real Observable Stream this way till lib uv is ready

- https://www.npmjs.com/package/fs-readdir-stream
Reply all
Reply to author
Forward
0 new messages