Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

How to read an identified part of a huge text file?

22 views
Skip to first unread message

Janis Papanagnou

unread,
Apr 2, 2023, 9:52:00 AM4/2/23
to
I want to read identified content from a huge text file that resides in
the file system. (My javascript code is embedded in a HTML page. I am
running all code client side and have no application servers or data
base systems running.)

I've found a suggestion using 'require("fs")' but the samples required
to load the whole file content so doesn't seem to fit for my megabytes
large data file which I strictly want to avoid loading as a whole.

My data file is actually structured as <key> <TAB> <text-data> lines
and I just want to extract the <text-data> given the respective <key>.
Is there some simple standard way to achieve that extraction?

The second question is whether it is possible to find the <key>s given
a text-match (a string match or ideally a regular expression match) on
the respective <text-data> on the external file?

For a solution/workaround to both questions it might be also useful to
call an external extractor (awk, perl, ...) from javascript and read in
the output of such an external tool invocation. - Is that possible?

Thanks for any hints.

Janis

Jon Ribbens

unread,
Apr 2, 2023, 10:49:36 AM4/2/23
to
On 2023-04-02, Janis Papanagnou <janis_pap...@hotmail.com> wrote:
> I want to read identified content from a huge text file that resides in
> the file system. (My javascript code is embedded in a HTML page. I am
> running all code client side and have no application servers or data
> base systems running.)
>
> I've found a suggestion using 'require("fs")' but the samples required
> to load the whole file content so doesn't seem to fit for my megabytes
> large data file which I strictly want to avoid loading as a whole.

require('fs') is a nodejs thing, which is not going to work if you're
using in-browser javascript.

> My data file is actually structured as <key> <TAB> <text-data> lines
> and I just want to extract the <text-data> given the respective <key>.
> Is there some simple standard way to achieve that extraction?

I think in a modern browser you might be able to use the fetch and
streams APIs to read the file a chunk at a time. e.g.

const response = await fetch('myfile.txt')
for await (const chunk of response.body) {
// Do something with each chunk
}

> The second question is whether it is possible to find the <key>s given
> a text-match (a string match or ideally a regular expression match) on
> the respective <text-data> on the external file?

Yes? I'm not sure I understand that question.

> For a solution/workaround to both questions it might be also useful to
> call an external extractor (awk, perl, ...) from javascript and read in
> the output of such an external tool invocation. - Is that possible?

Not from inside a browser, no.

Janis Papanagnou

unread,
Apr 2, 2023, 2:05:39 PM4/2/23
to
Thanks for your hints and insights thus far!

On 02.04.2023 16:49, Jon Ribbens wrote:
> On 2023-04-02, Janis Papanagnou <janis_pap...@hotmail.com> wrote:
>> My data file is actually structured as <key> <TAB> <text-data> lines

>> The second question is whether it is possible to find the <key>s given
>> a text-match (a string match or ideally a regular expression match) on
>> the respective <text-data> on the external file?
>
> Yes? I'm not sure I understand that question.

Where my first question was (informally described) by something like

Select <text-data> From <text-file> Where <key> Equals <search-key>

the second one operates on the data and returns text-data matching keys
that identify the data records like

Select <keys> From <text-file> Where <text-data> Matches <s1> And <s2>

with a possibility to either get all the s1/s2-matching key-identifier
in one returned set or which lets me sequentially get these keys or
let me operate on matching records (that are identified by the keys of
matching records).

Basically in both questions I have want access (line-wise, record-wise)
to the data, either the <text-data> selected by <key> or the <keys>
where the <text-data> match a search criterion.

The point is; once data is read into memory accessible to JS I can do
everything (including matching), but the problem is the bottleneck due
to the mass of data in the file, so I need to preselect the desired
records (to not have to load it completely into memory).

(I hope it got cleared and doesn't muddy it further.)

The suggestion of using await fetch('myfile.txt') sounds like it's
a raw (byte-oriented) data function (not line/record oriented one),
but I will be looking into that as well. Thanks again.

Janis

Jon Ribbens

unread,
Apr 2, 2023, 3:58:34 PM4/2/23
to
Yes, although there's an example of how to use it to read line-by-line
here:

https://developer.mozilla.org/en-US/docs/Web/API/ReadableStreamDefaultReader/read#example_2_-_handling_text_line_by_line

I think the only solution available to you in a browser is to use
IndexedDB. On the plus side though, it's quite a good solution.
Basically, write a function in JavaScript to read and parse the
file and load it into an in-browser database:

https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API

and then you can search this indexed database of objects, which
should be very fast and efficient. You just need to make sure that
your code checks for the existence of the database and re-creates
it from the file if it doesn't exist due to the browser having
decided to expire it.

JJ

unread,
Apr 2, 2023, 7:09:00 PM4/2/23
to
On Sun, 2 Apr 2023 20:05:31 +0200, Janis Papanagnou wrote:
>
> The suggestion of using await fetch('myfile.txt') sounds like it's
> a raw (byte-oriented) data function (not line/record oriented one),
> but I will be looking into that as well. Thanks again.
>
> Janis

With Fetch/XHR and the `Range` HTTP request header, you'll need to have a
pre-generated index file for the text file lines, if you want to get only
specific lines without having to read the whole file. The index file would
contain byte offsets for each line in the text file, so that you'll know the
byte range a specific line is located in the text file.

V

unread,
Apr 3, 2023, 9:55:17 AM4/3/23
to

Michael Haufe (TNO)

unread,
Apr 4, 2023, 8:59:53 PM4/4/23
to
In the latest browsers there is a feature called the Origin private file system (OPFS):

<https://developer.mozilla.org/en-US/docs/Web/API/File_System_Access_API#origin_private_file_system>

This provides a FileSystemSyncAccessHandle:

<https://developer.mozilla.org/en-US/docs/Web/API/FileSystemSyncAccessHandle>

which has a `read()` method:

<https://developer.mozilla.org/en-US/docs/Web/API/FileSystemSyncAccessHandle/read>

That method with an appropriately sized buffer (size being your record size) will let you access a specific location in the file
0 new messages