[Ann] New Mozilla Junk Mail Filter Tool Released!

David Spade

未読、

2003/06/24 1:40:202003/06/24

To:

On 23/06/2003 21:38, David Spade wrote:
> Hi All,
>
> This is the first release of what I hope will be a useful Junk Mail
> tool. I'll let the README speak about the features:
>
> ------------------------------------------------------------------
>
> [snip]
>
> ------------------------------------------------------------------
>
> Feast away, ladies and gents. I want ideas and bugs that are found so
> that I can make this thing better, so if you find something, tell me.
>
> Oh, and if anyone has a website that I can drop this on, I'd be much
> obliged. :)

It's come to my attention that some people may not wish to download J2SE
1.4.1, or consider it large and unwieldy. I have created a version of
this zipfile which also contains the classes, so that you will then only
need the Java 1.4.1 Runtime Environment to run the tool. If you're
interested in this, either email me about it (my email address is in the
readme that's in the original zip) or give me somewhere that I can put
it online - then, I can put a URL. The source zip is 34k, and the source
+ classes zip is 99k.

-=Straxus=-

us...@domain.invalid

未読、

2003/06/24 6:34:202003/06/24

To:

David Spade wrote:
> Hi All,
>
> This is the first release of what I hope will be a useful Junk Mail
> tool. I'll let the README speak about the features:

Hi, I liked it (considered writing one myself once :-)

I would love to have a column with a good vs bad indicator (divide bad by good)

So if the good = 40 and the bad = 140, the good vs bad = 3.5
So if the good = 100 and the bad = 20, the good vs bad = 0.2
So if the good = 20 and the bad = 300, the good vs bad = 15
So if the good = 700 and the bad = 10, the good vs bad = 0.01

So the bigger nummber the "badder" the token.

have fun,
Jan

Jason Airlie

未読、

2003/06/24 8:44:142003/06/24

To:

David Spade <mi...@home.com> wrote in message news:<bd8nnp$fc...@ripley.netscape.com>...

Wouldn't mozdev.org be an appropriate place to host this?

David Spade

未読、

2003/06/24 14:19:172003/06/24

To:

Interesting idea, and one that I'll keep in mind for the next version of
the tool.

Thanks!

-=Straxus=-

David Spade

未読、

2003/06/24 14:09:052003/06/24

To:

> Wouldn't mozdev.org be an appropriate place to host this?

Hmm, interesting. S'a good idea. Why didn't I think of that one?

Looking into creating a mozdev project now, and will post back later
with results.

-=Straxus=-

Jeffrey Siegal

未読、

2003/06/24 14:57:482003/06/24

To:

Thomas Dodd wrote:
>
>
> David Spade wrote:
>
>> Hi All,
>>
>> This is the first release of what I hope will be a useful Junk Mail
>> tool. I'll let the README speak about the features:
>
>

> The GUI comes up fine, butr I cannot write a text or html file.
> Bot fail the same why (replacing txt/text with html).
>
> ST Dog >>~/j2sdk1.4.2/bin/java -cp . mozilla_training_analyzer.Analyzer
> -i ~/.mozilla/default/2ji7t0kd.slt/training.dat -f text -o ~/MozJunk.txt
> The number of good messages processed is 8930
> The number of bad messages processed is 2530
> Now processing 245682 good
> tokens.......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
.
>
> Now processing 95394 bad
> tokens..............................................................................................................................................................................................................................................................................................................................................................................................
>
> Merging token lists...
> Writing tokens to /home/ted/MozJunk.txt
> Exception in thread "main" java.lang.OutOfMemoryError
>
>
>
> Any ideas? I'm not real big on Java :(

Try adding -Xmx256m (or possibley bigger numbers) to the java command
line. The default maximum heap size is 64 MB

Thomas Dodd

未読、

2003/06/24 14:53:592003/06/24

To: mozilla-...@mozilla.org、mi...@home.com.netscape.com

David Spade wrote:
> Hi All,
>
> This is the first release of what I hope will be a useful Junk Mail
> tool. I'll let the README speak about the features:

The GUI comes up fine, butr I cannot write a text or html file.
Bot fail the same why (replacing txt/text with html).

ST Dog >>~/j2sdk1.4.2/bin/java -cp . mozilla_training_analyzer.Analyzer
-i ~/.mozilla/default/2ji7t0kd.slt/training.dat -f text -o ~/MozJunk.txt
The number of good messages processed is 8930
The number of bad messages processed is 2530
Now processing 245682 good
tokens.......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

Now processing 95394 bad
tokens..............................................................................................................................................................................................................................................................................................................................................................................................
Merging token lists...
Writing tokens to /home/ted/MozJunk.txt
Exception in thread "main" java.lang.OutOfMemoryError

Any ideas? I'm not real big on Java :(

-Thomas

Karsten Düsterloh

未読、

2003/06/24 15:52:222003/06/24

To:

David Spade aber hob zu reden an und schrieb:

>> I would love to have a column with a good vs bad indicator (divide bad by good)

[...]

> Interesting idea, and one that I'll keep in mind for the next version of
> the tool.

The funny thing is: the first version of my junk statistics in Mnenhy
*did* show the good/bad ratio - and I got a lot of requests for dropping
that in favor of the junk probability of the word... ;-)

Karsten
--
Freiheit stirbt | Fsayannes SF&F-Bibliothek:
Mit Sicherheit | http://fsayanne.tprac.de/

David Spade

未読、

2003/06/24 16:42:392003/06/24

To:

On 24/06/2003 15:52, Karsten Düsterloh wrote:
> David Spade aber hob zu reden an und schrieb:
>
>>>I would love to have a column with a good vs bad indicator (divide bad by good)
>
> [...]
>
>>Interesting idea, and one that I'll keep in mind for the next version of
>>the tool.
>
> The funny thing is: the first version of my junk statistics in Mnenhy
> *did* show the good/bad ratio - and I got a lot of requests for dropping
> that in favor of the junk probability of the word... ;-)

Heh, interesting. When the idea was first mentioned, I thought that it
could show really skewed numbers if one number was low and the other was
high, for instance 1 bad to 35 good. I figured that this feature would
make it in at the same time I added selectable columns to the table
interface. :)

-=Straxus=-

Ed Mullen

未読、

2003/06/24 17:28:372003/06/24

To:

Files are now online at:

http://edmullen.net/moz.html
http://edmullen.net/MozTrainingFilterTool.zip
http://edmullen.net/MozTrainingFilterToolNowWith100PCMoreClass.zip

--
Ed Mullen
http://edmullen.net
http://edmullen.net/moz.html
If a cow laughed, would milk come out her nose?

David Spade

未読、

2003/06/24 18:51:212003/06/24

To:

On 24/06/2003 17:28, Ed Mullen wrote:
> David Spade wrote:
>
>>On 23/06/2003 21:38, David Spade wrote:
>>
>>
>>>Hi All,
>>>
>>>This is the first release of what I hope will be a useful Junk Mail
>>>tool. I'll let the README speak about the features:
>>>

>>>[snip]

>>>
>>>Feast away, ladies and gents. I want ideas and bugs that are found so
>>>that I can make this thing better, so if you find something, tell me.
>>>
>>>Oh, and if anyone has a website that I can drop this on, I'd be much
>>>obliged. :)
>>
>>It's come to my attention that some people may not wish to download J2SE
>>1.4.1, or consider it large and unwieldy. I have created a version of
>>this zipfile which also contains the classes, so that you will then only
>>need the Java 1.4.1 Runtime Environment to run the tool. If you're
>>interested in this, either email me about it (my email address is in the
>>readme that's in the original zip) or give me somewhere that I can put
>>it online - then, I can put a URL. The source zip is 34k, and the source
>>+ classes zip is 99k.
>

> Files are now online at:
>
> http://edmullen.net/moz.html
> http://edmullen.net/MozTrainingFilterTool.zip
> http://edmullen.net/MozTrainingFilterToolNowWith100PCMoreClass.zip

And thanks again for hosting these, Ed!

The original source version is the one found at:

http://edmullen.net/MozTrainingFilterTool.zip

For those of you who do not want to install the 1.4.1 JDK and just have
the 1.4.1 JRE, please download:

http://edmullen.net/MozTrainingFilterToolNowWith100PCMoreClass.zip

That contains all of the class files. Then, in the README you can just
skip the compilation instructions and go straight to the running
instructions.

-=Straxus=-

David Spade

未読、

2003/06/24 19:01:512003/06/24

To:

On 24/06/2003 14:53, Thomas Dodd wrote:
>
> David Spade wrote:
>
>>Hi All,
>>
>>This is the first release of what I hope will be a useful Junk Mail
>>tool. I'll let the README speak about the features:
>
> The GUI comes up fine, butr I cannot write a text or html file.
> Bot fail the same why (replacing txt/text with html).
>
> ST Dog >>~/j2sdk1.4.2/bin/java -cp . mozilla_training_analyzer.Analyzer
> -i ~/.mozilla/default/2ji7t0kd.slt/training.dat -f text -o ~/MozJunk.txt
> The number of good messages processed is 8930
> The number of bad messages processed is 2530
> Now processing 245682 good
> tokens...

> Now processing 95394 bad
> tokens...

> Merging token lists...
> Writing tokens to /home/ted/MozJunk.txt
> Exception in thread "main" java.lang.OutOfMemoryError
>
> Any ideas? I'm not real big on Java :(

As Jeffrey mentioned, beneficial command-line options for java are:

-Xms<size> set initial Java heap size
-Xmx<size> set maximum Java heap size

So, your execution command would become something like:

~/j2sdk1.4.2/bin/java -Xms64m -Xmx256m -cp .

mozilla_training_analyzer.Analyzer -i
~/.mozilla/default/2ji7t0kd.slt/training.dat -f text -o ~/MozJunk.txt

Make sure you put the JVM arguments before
mozilla_training_analyzer.Analyzer -- arguments after that will be
passed to the program, and that won't help your problem.

Out of curiosity, what size is your training.dat? I think mine is about
an order of magnitude smaller, so I didn't think to put in an
explanation of JVM switches for extra memory as I didn't have problems
during testing.

-=Straxus=-

David Spade

未読、

2003/06/24 19:45:172003/06/24

To:

On 24/06/2003 19:32, Matthias Versen wrote:
> David Spade wrote:
>
>
>>Hi All,
>>
>>This is the first release of what I hope will be a useful Junk Mail
>>tool. I'll let the README speak about the features:
>

> Is thjere somewhere a "compiled" java binary because I don't want to
> install the sdk.

Sure is!

[copied from a previous post]

Matthias Versen

未読、

2003/06/24 19:32:102003/06/24

To:

David Spade wrote:

> Hi All,
>
> This is the first release of what I hope will be a useful Junk Mail
> tool. I'll let the README speak about the features:

Is thjere somewhere a "compiled" java binary because I don't want to
install the sdk.

Matthias
--
Please delete everything between "matti" and the "@" in my mail address.

Daniel Greenspan

未読、

2003/06/25 8:44:112003/06/25

To:

>> David Spade wrote:
>> Is there somewhere a "compiled" java binary because I don't want to
>> install the sdk.

> Matthias Versen wrote:
> That contains all of the class files. Then, in the README you can just
> skip the compilation instructions and go straight to the running
> instructions.

Micro$oft wrote:

D:\Install>java -cp . mozilla_training_analyzer.Analyze
'java' is not recognized as an internal or external command, operable
program or batch file.

Daniel Greenspan writes:
Just illustrating David's point. While I seem to have some form of
Java available under and Mozilla, and Microsoft's scripting engine
will run .js files (although I wish it wouldn't 'cos they always
seem to be either viruses or slow buggy stuff), I cannot work out
whether I need to install Sun Java, etc. etc.

Daniel

Thomas Dodd

未読、

2003/06/25 11:29:362003/06/25

To: mozilla-...@mozilla.org

David Spade wrote:
> On 24/06/2003 14:53, Thomas Dodd wrote:
>> ST Dog >>~/j2sdk1.4.2/bin/java -cp .
>> mozilla_training_analyzer.Analyzer -i
>> ~/.mozilla/default/2ji7t0kd.slt/training.dat -f text -o ~/MozJunk.txt
>> The number of good messages processed is 8930
>> The number of bad messages processed is 2530
>> Now processing 245682 good tokens...
>> Now processing 95394 bad tokens...

Is that an unusual number of tokens or messages?

> ~/j2sdk1.4.2/bin/java -Xms64m -Xmx256m -cp .

Is 256m arbitrary? Why would the GUI work but there not be enough memory
to export the file wth out the GUI?

For what it's worth, those sizes (64m and 256m) worked for text and html
export.

> Out of curiosity, what size is your training.dat? I think mine is about
> an order of magnitude smaller, so I didn't think to put in an
> explanation of JVM switches for extra memory as I didn't have problems
> during testing.

5735910 bytes (5.5 MB) Same file has been built up since the early days
of the filter. It's not working as well as it once did, hence the desire
to use you tool to prune it, without complete retraining.

The text export of that file is 13M and 320k lines long.
I'm hoping that puneing the file will improve the accuracy of the
filter, and help the filter run faster. It has gotten very slow lately.

-Thomas

Thomas Dodd

未読、

2003/06/25 11:51:182003/06/25

To: mozilla-...@mozilla.org、Daniel Greenspan

Daniel Greenspan wrote:

> Micro$oft wrote:
>
> D:\Install>java -cp . mozilla_training_analyzer.Analyze
> 'java' is not recognized as an internal or external command, operable
> program or batch file.

So the JRE is not in your path. I have Sun's j2re1.4.1 on a Windoze box
here. It in both "WINDOWS" and "Program Files\Java\j2re1.4.1" as
java.exe on the machine.

>
> Daniel Greenspan writes:
> Just illustrating David's point. While I seem to have some form of
> Java available under and Mozilla, and Microsoft's scripting engine
> will run .js files (although I wish it wouldn't 'cos they always
> seem to be either viruses or slow buggy stuff), I cannot work out
> whether I need to install Sun Java, etc. etc.

You probably need a full jre. Not sure what the Mozilla plugin for
Windoze comes with though.

David Spade

未読、

2003/06/25 12:02:062003/06/25

To:

On 25/06/2003 11:29, Thomas Dodd wrote:
>
> David Spade wrote:
>
>>On 24/06/2003 14:53, Thomas Dodd wrote:
>>
>>>ST Dog >>~/j2sdk1.4.2/bin/java -cp .
>>>mozilla_training_analyzer.Analyzer -i
>>>~/.mozilla/default/2ji7t0kd.slt/training.dat -f text -o ~/MozJunk.txt
>>>The number of good messages processed is 8930
>>>The number of bad messages processed is 2530
>>>Now processing 245682 good tokens...
>>>Now processing 95394 bad tokens...
>
> Is that an unusual number of tokens or messages?

Hard for me to say; I have a sample size of 1 (only my own training.dat). :)

>>~/j2sdk1.4.2/bin/java -Xms64m -Xmx256m -cp .
>
> Is 256m arbitrary? Why would the GUI work but there not be enough memory
> to export the file wth out the GUI?
>
> For what it's worth, those sizes (64m and 256m) worked for text and html
> export.

Absolutely arbitrary - just examples thrown in to show you how to use
the switches.

Hmm, interesting... So are you saying that when you tried to export as
text from the GUI it worked, but from the CLI (Command Line Interface)
it returned an OutOfMemoryError? That's bizarre, because they both call
the same method.

If on the other hand you're saying that the GUI displays the tokens
fine, but the export function bogs down, then that's understandable as
well. When doing parsing, the very first thing that I do is read the
contents of the training file into a data structure (I called it a
TrainingData, if you want to follow along in the source). For the GUI, I
then display that data in the JTable, however when doing output I dump
all of the data into a StringBuffer (way faster than either adding it to
a String or writing directly out to file) and then, at the end, write it
out so there's only one I/O operation (at least from my point of view).
This essentially duplicates the data, and uses a lot more memory than
the GUI does. It may be this duplication of the data that causes your
OutOfMemoryError... I'll have to look into trying to be smart about this
in the code.

>>Out of curiosity, what size is your training.dat? I think mine is about
>>an order of magnitude smaller, so I didn't think to put in an
>>explanation of JVM switches for extra memory as I didn't have problems
>>during testing.
>
> 5735910 bytes (5.5 MB) Same file has been built up since the early days
> of the filter. It's not working as well as it once did, hence the desire
> to use you tool to prune it, without complete retraining.
>
> The text export of that file is 13M and 320k lines long.
> I'm hoping that puneing the file will improve the accuracy of the
> filter, and help the filter run faster. It has gotten very slow lately.

Training file here is 1,232,107 (1.2 megish).

Hmm... My recommendation would be to remove all tokens that have less
than about 20 in each column. That'll take out the vast majority, and
leave you with your prize-winning tokens. As usual though, make backups
first in case my thinking is wrong.

-=Straxus=-

David Spade

未読、

2003/06/25 12:10:372003/06/25

To:

On 25/06/2003 08:44, Daniel Greenspan wrote:

>>> Matthias Versen wrote:
>>> Is there somewhere a "compiled" java binary because I don't want to
>>> install the sdk.
> >

>> David Spade wrote:
>> That contains all of the class files. Then, in the README you can just
>> skip the compilation instructions and go straight to the running
>> instructions.
>
> Micro$oft wrote:
>
> D:\Install>java -cp . mozilla_training_analyzer.Analyze
> 'java' is not recognized as an internal or external command, operable
> program or batch file.
>
> Daniel Greenspan writes:
> Just illustrating David's point. While I seem to have some form of
> Java available under and Mozilla, and Microsoft's scripting engine
> will run .js files (although I wish it wouldn't 'cos they always
> seem to be either viruses or slow buggy stuff), I cannot work out
> whether I need to install Sun Java, etc. etc.

You inverted mine and Matti's quotes - I've corrected them above.

In my testing, I was using Sun's Java 1.4.1 SDK, however the 1.4.1 JRE
will work just fine as well. In fact, you can use any java 1.4.1
compliant JRE (IBM, Blackdown?) to run this application. If you want,
the Sun JREs are available from http://java.sun.com.

-=Straxus=-

Thomas Dodd

未読、

2003/06/25 14:14:192003/06/25

To: mozilla-...@mozilla.org

David Spade wrote:
> If on the other hand you're saying that the GUI displays the tokens
> fine, but the export function bogs down, then that's understandable as

That's the one.

> then display that data in the JTable, however when doing output I dump
> all of the data into a StringBuffer (way faster than either adding it to

OK, The duplication of the entire file, as a string, is probably the
problem. Notice howmuch bigger the text version is than the binary .dat
file. if the int->string conversion was done at output only, it'd save a
lot of memory. I would have wrote each row of the table (one token) to
the file. Use the OS and filesystem buffering to optimize the writes.

> OutOfMemoryError... I'll have to look into trying to be smart about this
> in the code.

I haven't really looked at the code, nor do I know java well. For the
text and html the structure closely follows the table , so toutput is
easy. Probably the same for XML. Not a clue about the mozilla native format.

> Training file here is 1,232,107 (1.2 megish).
>
> Hmm... My recommendation would be to remove all tokens that have less
> than about 20 in each column. That'll take out the vast majority, and
> leave you with your prize-winning tokens. As usual though, make backups
> first in case my thinking is wrong.

I have some pretty starnge toknes in the file too. Thinks like <CTRL-R>
or <ESC>$b$k<ESC>. Not sure where those cam form. They are non printable
and don't show up in the GUI. Perhaps you shou check the tokes for
non-printable characters and display them in some code?

-Thomas

David Spade

未読、

2003/06/25 14:48:412003/06/25

To:

On 25/06/2003 14:14, Thomas Dodd wrote:
>
> David Spade wrote:
>
>>If on the other hand you're saying that the GUI displays the tokens
>>fine, but the export function bogs down, then that's understandable as
>
> That's the one.
>
>>then display that data in the JTable, however when doing output I dump
>>all of the data into a StringBuffer (way faster than either adding it to
>
> OK, The duplication of the entire file, as a string, is probably the
> problem. Notice howmuch bigger the text version is than the binary .dat
> file. if the int->string conversion was done at output only, it'd save a
> lot of memory. I would have wrote each row of the table (one token) to
> the file. Use the OS and filesystem buffering to optimize the writes.

Hmm, in the past Java has been really bad at I/O operations, so I
learned to stay away from them. I/O (and Object creation) tends to be
what causes Java programs to run quite slowly.

I believe that a solution to this problem might be to add a flag to the
program that causes it to write in high I/O mode to save memory. Then,
people that experience OutOfMemory Errors can elect to do the conversion
in high I/O mode with the understanding that it might be a bit slower,
but will work.

Hmm, I'll add that comment to the Bugzilla bug. I'm just setting up a
Mozdev.org project for it (It's not ready yet, so I'm not putting the
name) and I've already flagged this in the Bugzilla database.

>>OutOfMemoryError... I'll have to look into trying to be smart about this
>>in the code.
>
> I haven't really looked at the code, nor do I know java well. For the
> text and html the structure closely follows the table , so toutput is
> easy. Probably the same for XML. Not a clue about the mozilla native format.

There's 2 main ways to do XML in Java (or anywhere else for that matter)
- manually, or through an API (like DOM or SAX). I chose manual since
the XML is quite simple, so it has the same output structure as the rest
of the output methods (dump to StringBuffer, then write at end). APIs
are generally a better way to do it, but they are also painfully,
painfully slow.

>>Training file here is 1,232,107 (1.2 megish).
>>
>>Hmm... My recommendation would be to remove all tokens that have less
>>than about 20 in each column. That'll take out the vast majority, and
>>leave you with your prize-winning tokens. As usual though, make backups
>>first in case my thinking is wrong.
>
> I have some pretty starnge toknes in the file too. Thinks like <CTRL-R>
> or <ESC>$b$k<ESC>. Not sure where those cam form. They are non printable
> and don't show up in the GUI. Perhaps you shou check the tokes for
> non-printable characters and display them in some code?

Hmm, it's probably gummed conversion of multibyte characters which were
found in the emails. I'll have to look into that. As it was, I had to
specify the XML's output type as ISO-8859-1 because the XML would fail
on import as either UTF-8 or UTF-16.

-=Straxus=-

Thomas Dodd

未読、

2003/06/25 17:36:052003/06/25

To: mozilla-...@mozilla.org

David Spade wrote:
> Hmm, in the past Java has been really bad at I/O operations, so I
> learned to stay away from them. I/O (and Object creation) tends to be
> what causes Java programs to run quite slowly.

OK. As I said, I've never used java much.

> I believe that a solution to this problem might be to add a flag to the
> program that causes it to write in high I/O mode to save memory. Then,
> people that experience OutOfMemory Errors can elect to do the conversion
> in high I/O mode with the understanding that it might be a bit slower,
> but will work.

Perhaps, stick to the current method if the GUI is up, but if the
command line options -f and -o were used (so no GUI), bypass the string
buffer method. In the string buffer method, could you catch the
exception, and suggest using the command line to do the export?

Does the mozilla dat file also use the big string buffer?
I wonder how likely it is to run out of memory?
My dat file was 5.5M, but the text file is 13M and the html is 16M.
I immagine the XML would be even bigger.

> Hmm, it's probably gummed conversion of multibyte characters which were
> found in the emails. I'll have to look into that. As it was, I had to
> specify the XML's output type as ISO-8859-1 because the XML would fail
> on import as either UTF-8 or UTF-16.

The're in the text and HTML files. Once you get the project open I can
post them and the dat file. The GUI shows a bad character (empty
rectangle) but that's on Solaris with j2re1.4.2. Displaying the HTML in
mozilla just skips the chartacters regardless of the character coding.

-Thomas

Jeffrey Siegal

未読、

2003/06/25 23:37:462003/06/25

To:

David Spade wrote:
> however when doing output I dump
> all of the data into a StringBuffer (way faster than either adding it to
> a String or writing directly out to file) and then, at the end, write it
> out so there's only one I/O operation (at least from my point of view).

Try wrapping the file in a BufferedOutputStream or BufferedWriter and
writing to that. You'll get most of the performance you get from
building a StringBuffer without having to duplicate the entire data set
in memory.

David Spade

未読、

2003/07/03 1:14:412003/07/03

To:

On 25/06/2003 17:36, Thomas Dodd wrote:
>
> David Spade wrote:
>
>>Hmm, in the past Java has been really bad at I/O operations, so I
>>learned to stay away from them. I/O (and Object creation) tends to be
>>what causes Java programs to run quite slowly.
>
> OK. As I said, I've never used java much.
>
>>I believe that a solution to this problem might be to add a flag to the
>>program that causes it to write in high I/O mode to save memory. Then,
>>people that experience OutOfMemory Errors can elect to do the conversion
>>in high I/O mode with the understanding that it might be a bit slower,
>>but will work.
>
> Perhaps, stick to the current method if the GUI is up, but if the
> command line options -f and -o were used (so no GUI), bypass the string
> buffer method. In the string buffer method, could you catch the
> exception, and suggest using the command line to do the export?
>
> Does the mozilla dat file also use the big string buffer?
> I wonder how likely it is to run out of memory?
> My dat file was 5.5M, but the text file is 13M and the html is 16M.
> I immagine the XML would be even bigger.

A far better solution to this problem than the high I/O one I thought up
has been proposed, and I'll be using BufferedWriters in the next release
of that tool.

>>Hmm, it's probably gummed conversion of multibyte characters which were
>>found in the emails. I'll have to look into that. As it was, I had to
>>specify the XML's output type as ISO-8859-1 because the XML would fail
>>on import as either UTF-8 or UTF-16.
>
> The're in the text and HTML files. Once you get the project open I can
> post them and the dat file. The GUI shows a bad character (empty
> rectangle) but that's on Solaris with j2re1.4.2. Displaying the HTML in
> mozilla just skips the chartacters regardless of the character coding.

Hoping to have the Mozdev project functional by next weekend, but in the
meantime you can add your comments to the Bugzilla bug I filed for this
behaviour: http://mozdev.org/bugs/show_bug.cgi?id=3946

-=Straxus=-

David Spade

未読、

2003/07/03 1:16:242003/07/03

To:

Forgot all about that little class. S'an excellent idea, and I think
I'll do that in the next release, giving it a 64k buffer to write into.

Thanks for the idea!

-=Straxus=-