Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Recognising file type (ascii/binary)

0 views
Skip to first unread message

Bruce Lee

unread,
Oct 27, 2005, 11:05:19 AM10/27/05
to
Is there any easy way to get Java to determine whether a file is a binary
file or plain text ascii file?


Matt Humphrey

unread,
Oct 27, 2005, 6:01:10 PM10/27/05
to

"Bruce Lee" <bl...@blahbllbllahblah.com> wrote in message
news:P068f.23301$Ih5....@fe1.news.blueyonder.co.uk...

> Is there any easy way to get Java to determine whether a file is a binary
> file or plain text ascii file?

Files are simply sequences of (binary) bytes--there's no way to tell whether
it's supposed to contain only bytes that represent printable ascii (or
unicode) or any particular binary pattern. You can read the file to find
out--if you find values that signify unlikely or non-printable characters
you can deem the file binary or corrupt. Similarly, there are heuristics
(based on convention) for guessing the "type" of the file based on the first
few bytes, but there's no guarantee these are correct either. (And files
with 2-byte UNICODE characters can really confuse things.)

Of course, you could require that text files end in "txt" or something--it's
no worse than any of the above and significantly easier.

What are you trying to do?

Cheers,


Oliver Wong

unread,
Oct 28, 2005, 4:15:59 PM10/28/05
to

"Matt Humphrey" <ma...@ivizNOSPAM.com> wrote in message
news:XpSdneq4Oe9...@adelphia.com...

Matt Humphrey is completely correct. However as an additional check to
the heuristic of looking for unprintable characters, another trick is to
check if the newline string is consistent. It should always be either "\n"
(for UNIX-like systems), "\r" (for Mac-like systems) or "\r\n" (for
Windows-like systems). If the file starts switching around between these, it
probably isn't a valid ASCII file on any of the above three platforms.

You could also disregard 2-byte UNICODE characters as being "non-ASCII",
and lump them in with the category of "binary files".

- Oliver


Bruce Lee

unread,
Oct 30, 2005, 12:24:01 AM10/30/05
to

"Matt Humphrey" <ma...@ivizNOSPAM.com> wrote in message
news:XpSdneq4Oe9...@adelphia.com...
>

To see if a url is binary or not without relying on the header.

I'm using something like this:

protected boolean isBinary(String url){

boolean isbin=false;
java.io.InputStream in=null;


try{
URL bin_url = new URL(url);

in = bin_url.openStream();
BufferedReader r = new BufferedReader(new InputStreamReader(in));

char [] cc= new char[255]; //do a peek
r.read(cc,0,255);

double prob_bin=0;

for(int i=0; i<cc.length; i++){
int j = (int)cc[i];

if(j<32 || j>127){ //with chinese and other type languages it might
flag them as binary - need another check ideaaly
prob_bin++;
}

}

double pb = prob_bin/255;
if(pb>0.5){
// System.out.println("probably binary at "+pb);
isbin= true;
}

}

in.close();

}catch(Exception ee){
System.out.println("WARN! Couldn't find isBinary() content-"+url);
isbin= false; //error - likely broken link - so return false
}

try{
in.close();
}catch(Exception E){}

System.out.println("url isBinary():"+url+":"+isbin);
return isbin;

}

I read somewhere that finding \n's might work as well.

Also, are ASCII 7bit and binary 8bit or something? Is there a way to find
this out - like analyse a byte?


Roedy Green

unread,
Oct 30, 2005, 1:53:09 AM10/30/05
to
On Thu, 27 Oct 2005 15:05:19 GMT, "Bruce Lee"
<bl...@blahbllbllahblah.com> wrote, quoted or indirectly quoted someone
who said :

>Is there any easy way to get Java to determine whether a file is a binary
>file or plain text ascii file?

A practical test is to scan the first N bytes for a 0. If you find
one it is a binary, if not text.

It actually becomes a judgment call.

Let as say you define a text file as containing only 7-bit ASCII, no
control chars but \t space \n \r.

Then you find an 0x01 char somewhere in the file. Does that make it a
binary format?

Unfortunately not all OS's track the format/MIME etc of each file.
There is no universal scheme of embedded id signatures. It is a mess.
You have to do something seat of the pants yourself.

You can't even tell which encoding is used for a pure text file.

--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Oliver Wong

unread,
Oct 31, 2005, 1:26:03 PM10/31/05
to

"Bruce Lee" <bl...@blahbllbllahblah.com> wrote in message
news:BVX8f.8492$m%6.2...@fe3.news.blueyonder.co.uk...

>
> Also, are ASCII 7bit and binary 8bit or something?

There is not "bit length" associated with the concept of "binary". The
question is equivalent to "Is decimal 5 digits long or 7 digits long?" A
number written in decimal can be any number of digits long.

> Is there a way to find
> this out - like analyse a byte?

This is reminiscent of an discussion Roedy and I had about ASCII versus
binary formats. My position was that all data stored on a computer is stored
in binary (i.e. they are stored using bits), and one form of binary encoding
is called "ASCII". It was was a poor choice of wording to use "binary" to
mean "non-ASCII".

I'm assuming you don't directly care whether a given bitstream is ASCII
or non-ASCII; rather, you want this information so that you can solve
another problem. What is the real problem you are trying to solve? Perhaps
we can offer you solutions that don't involve distinguishing between ASCII
and non-ASCII bitstreams.

- Oliver

* The reason you may want to avoid distinguishing ASCII and non-ASCII
bitstream is that in general, it is completely impossible. There may exist
binary file formats out there which, given appropriate data to represent,
yeild bits which can legally be decoded into only printable characters using
the ASCII table, but that the semantic information in the file was never
meant to be text.


0 new messages