Large files on Windows?

106 views
Skip to first unread message

Nedo Papanti

unread,
Nov 17, 2023, 4:36:21 PM11/17/23
to scite-interest
Hello,

Please note: I am not an expert.

On Windows 11 64-bit I have installed the x64 SciTE version for Windows created by Troy Simpson and available here https://www.ebswift.com/scite-text-editor-installer.html (that is https://www.ebswift.com/uploads/7/0/4/0/70403747/scite-5.3.7x64.msi).

When I try to open a file which is large 2,510,747,416 bytes (it is an aminoacidic sequence database in fasta format), SciTE warns me that the file is larger than the 2,000,000,000 bytes limit set in the properties and I am asked if I still want to open it; when I click on "Yes" I wait for a while and then the SciTE window fills up with "NUL" texts and the vertical scroll bar is at the bottom.
If I move the scroll bar up above a certain point I start seeing the correct text I am expecting but then if I scroll back down, at a certain point, the "NUL" texts fills the SciTE window again.

I am aware that there are other options to deal with big files of this type but, despite this, is there anything that I can do to improve this with SciTE?

I have searched this Google group with "file.size.large" and I read also the sentence from Neil Hodgson saying "An undocumented file.size.large property was added to SciTE that allowed loading files larger than 2GB. However, SciTE uses 32-bit integers for file positions so many features wouldn’t work past the 2GB point" but I am not able to say with certainty if there is something that can be done for my case.

On my SciTEGlobal.properties file I currently have:
file.size.large=1000000
file.size.no.styles=1000000
#idle.styling=1 (uncommenting it didn't change what I reported above nor did it have an effect on the speed of scrolling up and down in such a big file like mine)

Thank you

Best Regards

Nedo

Neil Hodgson

unread,
Nov 18, 2023, 5:01:44 AM11/18/23
to scite-interest
Hi Nedo,

When I try to open a file which is large 2,510,747,416 bytes (it is an aminoacidic sequence database in fasta format)

OK, looked up fasta and it should have short (< 200 byte) lines. Very long lines can cause performance problems. Extremely long lines may cause failures.

If you own the file or have permission to make it available, it may help to publish it, preferrably compressed.

, SciTE warns me that the file is larger than the 2,000,000,000 bytes limit set in the properties and I am asked if I still want to open it; when I click on "Yes" I wait for a while and then the SciTE window fills up with "NUL" texts and

 The most likely reason for NUL blobs appearing is that the file contains NUL bytes.

SciTE uses RAM to store the entire contents of files and also needs more memory to index lines, so may not work well on this file if your computer contains less than 4GB of RAM.
 
the vertical scroll bar is at the bottom.

That is unusual: SciTE normally shows the start of newly opened files.
 
If I move the scroll bar up above a certain point I start seeing the correct text I am expecting but then if I scroll back down, at a certain point, the "NUL" texts fills the SciTE window again.

That may mean that there are blocks of NUL bytes at the file end.

I am aware that there are other options to deal with big files of this type but, despite this, is there anything that I can do to improve this with SciTE?

First determine if there is a problem: count the number of NUL bytes in the file with a script. Something like this piece of Python with fileName set to the path to your file.

fileName = "../bin/SciTE.exe"
import pathlib
bytes = pathlib.Path(fileName).read_bytes()
print(f"{fileName} contains {bytes.count(b'\0')} NULs in {len(bytes)} bytes")
 
I have searched this Google group with "file.size.large" and I read also the sentence from Neil Hodgson saying "An undocumented file.size.large property was added to SciTE that allowed loading files larger than 2GB. However, SciTE uses 32-bit integers for file positions so many features wouldn’t work past the 2GB point" but I am not able to say with certainty if there is something that can be done for my case.

The 64-bit version of SciTE has used 64-bit positions for over 3 years now and can load 2.5 GB files - I just loaded a 4GB file and it worked correctly.

Neil

Nedo Papanti

unread,
Nov 20, 2023, 4:09:57 AM11/20/23
to scite-interest
Hi Neil,

Thank you for your prompt feedback.

- The two (randomly chosen for testing purposes) fasta files I used are public and are these two (two locations each so that people can choose the fastest location for them; they are regularly updated but the same versions I used for my original email will be still there for a couple of months or so):
https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot_varsplic.fasta.gz
https://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot_varsplic.fasta.gz
https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
https://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
SciTE (as well as other programs) can open/view/edit these two file without issues as far as my current experience is concerned and I didn't stumble in any NUL values in these two files over the many years I've been using them with SciTE and other programs.
Unfortunately I won't be able to count the NUL values any time soon (despite you helpful and clear details) so I prefer to give full details of what I did hoping that someone with experience could give advise.
- In order to create a test file larger than 2,000,000,000 bytes (so to be able to test for the first time SciTE with large files) what I did is pasted together these two fasta files on the Windows command prompt with:
copy /b uniprot_sprot_varsplic.fasta+uniprot_sprot_varsplic.fasta UP.fasta
This UP.fasta file can be opened/viewed/edited by SciTE and other programs without issues and I didn't stumble in any NUL values in this file so I guess the "copy /b" command didn't do any harm in that sense (I've been doing this type of merging of this type of files for years and years and never had any sort of issue whatsoever with the upstream use I made of them).
Next I created eight copies of the same UP.fasta and I merged them together again with "copy /b" to have the 2,510,747,416 bytes test file I mentioned in my original email.
- The RAM of the machine I used is 32 GB
- Some of the fasta headers (a fasta header is what comes after the ">" character up to the first line break) can be long; quite a lot more than the 60-character line break in the sequence area (the sequence area is the text in between two ">" characters and excluding the fasta headers lines)

With this info it should be possible to try reproducing my issue (Windows and SciTE for Windows details in my original email).

Hope this additional info helps in trying to figure out if there's space for improvement in my usage of SciTE with these type of large files.

Thanks again

Best Regards

Emanuele

Nedo Papanti

unread,
Nov 20, 2023, 4:50:17 AM11/20/23
to scite-interest
Hi Neil,
I just discovered that the issue is with the right click options in the SciTE for Windows I am using.
If I open SciTE and then I go to "File, Open" and I select my merged fasta file, I see a progress bar and then the file is there (scroll bar at the top, like you said), no NUL values and all is fine, meaning that also scrolling through the large fasta file is really a nice experience.
If I right click (like I always do) on my merged fasta file and I select either "Edit with SciTE in New Tab" or "Edit with SciTE in New Window" I don't get the progress bar and I get what I detailed below (NUL values) and also the scrolling experience is really slow.
Hope this helps.
Nedo

Neil Hodgson

unread,
Nov 20, 2023, 3:57:10 PM11/20/23
to scite-interest
Hi Nedo,

I just discovered that the issue is with the right click options in the SciTE for Windows I am using.
If I open SciTE and then I go to "File, Open" and I select my merged fasta file, I see a progress bar and then the file is there (scroll bar at the top, like you said), no NUL values and all is fine, meaning that also scrolling through the large fasta file is really a nice experience.
If I right click (like I always do) on my merged fasta file and I select either "Edit with SciTE in New Tab" or "Edit with SciTE in New Window" I don't get the progress bar and I get what I detailed below (NUL values) and also the scrolling experience is really slow.

Check that these right click menu items are connected to the same version of SciTE as when opening in other ways. Look at Help | About SciTE (or About Sc1 for the single file version) and look at the version number on the 2nd/3rd line and whether "32-bit" is the second line. Its possible these menu items are connected to an older 32-bit version of SciTE.

Neil

Message has been deleted

Nedo Papanti

unread,
Nov 21, 2023, 3:39:06 AM11/21/23
to scite-interest
Hi Neil,

I have my own Italian language file for SciTE so when I open SciTE and I go on "Aiuto, Informazioni su SciTE" I get the following text then followed by the list of names of collaborators:
----
SciTE
Versione 5.3.7   Scintilla:5.3.6   Lexilla:5.2.6
    Jul 26 2023 15:44:00
di Neil Hodgson.
December 1998-July 2023.
http://www.scintilla.org
Lua scripting language by TeCGraf, PUC-Rio
    http://www.lua.org
Traduzione Italiana di SciTE
----
As I said in my previous posts, I have used Troy Simpson's x64 msi installer for SciTE version 5.3.7 and no other previous versions of SciTE were installed on that machine.

Thanks again

Best Regards

Neil Hodgson

unread,
Nov 21, 2023, 4:03:39 AM11/21/23
to scite-interest
I have been able to reproduce the problem with right clicking to run SciTE.

To avoid wasting space on indexes like line start positions, Scintilla document objects are created as either normal (less than 2GB) or large and either with styles or without styles. To simplify implementation, documents can't switch between these modes once created.

The different behaviour experienced is because large files are interactively opened using multi-threading so SciTE remains responsive. However, when some other code requests a file be opened, that is performed synchronously so that subsequent commands (like go to line 876543 or highlight every "wittichii") will operate on the completely loaded file.

The different approaches also use different calls: multi-threaded checks the size and creates a large file mode document if needed; single threaded commonly reuses an existing normal size document. Thus the right-click menu is using a document object that is not capable of holding more than 2GB and it gets a litle confused.

One solution could be to force the multi-threaded loader for large files. The calling application could wait before asking SciTE to do more actions.

Another possibilty is for the single-threaded loader to check for large files and create a new large document object. The code for this is a bit complicated though.

Neil

Nedo Papanti

unread,
Nov 21, 2023, 4:26:38 AM11/21/23
to scite-interest
Hi Neil,

I think I grasp the high-level meaning of what you say.

From my side as a user I am happy to know that I can use right-click options as usual while I need to go for "File, Open" for large files.

SciTE is really great: thanks to all involved!

Thanks again

Best Regards

Neil Hodgson

unread,
Nov 27, 2023, 4:41:48 PM11/27/23
to scite-interest
I have made changes that check for large files when asked to open by another application or specified on the command line.

If the set of desired document flags (TextLarge and StylesNone) are different from the current document then a new document is created with the desired flags. Since this is a new document, features like undo history will *not* carry over when a file is reopened in this way.

The committed changes can be examined either in the repository:

hg clone http://hg.code.sf.net/p/scintilla/scite

or from

https://www.scintilla.org/wscite.zip Windows executable (64-bit)

Neil
Reply all
Reply to author
Forward
0 new messages