Are there a Char type of 1 Byte (8 bits) on Julia?

728 views
Skip to first unread message

Diego Javier Zea

unread,
Dec 15, 2012, 5:14:02 PM12/15/12
to julia...@googlegroups.com
There is a Char type of 32 bits. If you work with a huge array of Chars (ASCII, not Unicode)... Is a 8 bits Char going to be more efficient on (both on velocity and memory)? There it's a 1 byte Char on Julia? Thanks :)

Stefan Karpinski

unread,
Dec 15, 2012, 8:14:16 PM12/15/12
to julia...@googlegroups.com
You don't generally work with arrays of Chars, you work with String objects which encode stings as bytes, using eg UTF-8.


On Saturday, December 15, 2012, Diego Javier Zea wrote:
There is a Char type of 32 bits. If you work with a huge array of Chars (ASCII, not Unicode)... Is a 8 bits Char going to be more efficient on (both on velocity and memory)? There it's a 1 byte Char on Julia? Thanks :)

--
 
 

Keno Fischer

unread,
Dec 15, 2012, 5:17:41 PM12/15/12
to julia...@googlegroups.com
Int8 and Uint8
> --
>
>

Diego Javier Zea

unread,
Dec 16, 2012, 5:01:36 PM12/16/12
to julia...@googlegroups.com
It's depends. On bioinformatics, it's common and very useful do operations over arrays of Chars. For example, load big alignments (of a huge number of sequences) and do some calculation, is more efficient on matrices than in list of strings. Sequences "can also be a cell array of strings or a char array" - http://www.mathworks.com/help/bioinfo/ref/multialign.html -  but is more efficient work on char arrays when multiples sequences are manipulated - http://stackoverflow.com/questions/13552916/numpy-and-biopython-must-be-integrated -. At moment we are going to use the Char definition on Julia for a Bio module [ https://groups.google.com/forum/?hl=es&fromgroups=#!topic/julia-dev/Ofm2QoALIuA ], because a matrix of Int8 isn't human readable. But, a Char of 1 Byte can be a better and efficient definition for this kind of objects. Thanks!!

Keno Fischer

unread,
Dec 16, 2012, 5:20:17 PM12/16/12
to julia...@googlegroups.com
How about a matrix of Uint8's with a custom show method. Alternatively
a custom bitstype, but you'd have to redefine all operations that you
want to do (I don't know what your area of applications is, so I can't
say which is better).
> --
>
>

Diego Javier Zea

unread,
Dec 16, 2012, 6:51:27 PM12/16/12
to julia...@googlegroups.com
I'm think in something similar using conversions. I'm going to test it... But I really think that splitting Char into a Char8 and Char32 (alias Char for compatibility) can be more useful, powerful and general, don't you think? ;) [ Now It's the time for this kinds of changes, before it's too late ]

Stefan Karpinski

unread,
Dec 16, 2012, 11:38:30 PM12/16/12
to Julia Users
Having both Char8 and Char32 adds a huge amount of complication for no benefit. The main reason it doesn't buy you anything is that general purpose registers are no smaller than 32 bits, so LLVM would expand Char8 to 32 bits to do any work anyway. The only real use case for bits types smaller than 32 is compact storage, which both ASCIIString and UTF8String types already have without introducing more Char types.

In other languages, co-opting strings for bioinformatics data may be the best approach, but for Julia, that's definitely not what I would recommend. Instead, I strongly suggest creating custom bits types and either working with arrays of them or making some custom sequence container. (There may be a case for a Stringlike abstraction above String, to which generic code that applies equally well to sequences of things like DNA base pairs and strings of characters.)

You can, for example, define an 4-bit BasePair type, and have a BasePairArray that uses only 4 bits per pair (like how BitArray packs 8 values into each byte). You can even have a string literal representation for these arrays – dna"CGAATAACG". While Keno is quite right that you would have to define operations on BasePair objects, that seems reasonable – I can't see why integer arithmetic makes sense for BasePairs. It might also be useful to have a Codon type or an AminoAcid type and work with sequences of those. Much of the sequence manipulation logic and code can be shared with between types.

One of the design features of the language is that you don't have to try to cram stuff like this into built in system types to get performance or features: you can do all of this in pure user-land Julia code.


--
 
 

Jeff Bezanson

unread,
Dec 17, 2012, 1:17:06 AM12/17/12
to julia...@googlegroups.com
We don't yet support bits types whose sizes aren't multiples of 8. If
we did we could have "bitstype 1 Bool" and BitArray would not have
been necessary :)

But you can feel free to define "bitstype 8 Char8", and copy char.jl
for ideas of what to implement, though you probably wouldn't need most
of it.
> --
>
>

Stefan Karpinski

unread,
Dec 17, 2012, 1:26:09 AM12/17/12
to Julia Users
Right, BasePair would have to be of size 8, but otherwise everything goes through.


--



Stefan Karpinski

unread,
Dec 17, 2012, 3:10:54 PM12/17/12
to julia...@googlegroups.com
I would strongly encourage not copting strings for this. In other languages this may be the best option, but in Julia you can easily create a human-readable BasePair type and work efficiently with arrays of those. Keep in mind that Julia's Char and String types are just user-defined types that happen to have literal syntaxes. You can even create your own string literal syntax for DNA using prefixed strings (see the manual) so that dna"CGATTACAA" would produce an array of BasePairs, or whatever the best representation turns out be. Since these things are stringlike, we may want to factor out an abstraction above String that let's Unicode strings and stringlike objects for bioinformatics share appropriate pieces of generic logic.

Some of the obvious possible benefits of not representing bioinformatics data with Unicode strings include: being able to ignore all the complications of variable-width encoding (e.g. UTF-8); being able to use only 4 bits per base pair; maybe allowing the representation of DNA to be mutable, unlike strings.
--
 
 

Diego Javier Zea

unread,
Dec 19, 2012, 10:40:53 AM12/19/12
to julia...@googlegroups.com
Hi!!! A lot of thanks for all the advice :)
I'm not able to use only 4 bits for Nuclei Acid, because in IUPAC code (including gaps) are 18 characters (and because Julia doesn't support less than 8).
A representation of Sequence objects using 8 bits looks to be the better option.

I read the manual and create types are easy in Julia, but maybe create to much types can be annoying for future users.

This is a personal opinion on usability:
Sometimes use core features and types makes things more easy to learn and use, because you can reuse your knowledge.
One example: Numpy/Scipy types are great, I know one person that explicit reuse to use and learn and use Scipy/Numpy because is an entire world into Python and get boring with to large documentation.
Make things easy to use it's a good feature.
At the same time, this makes more easy to use other tools of the language.
Sometimes, it's easy convert a type to other (and create new methods from other methods in Julia). But the change form one object to other comes with some penalties (I don't test this penalties of converting types on Julia, but I remember Biopython to numpy penalties on performance how one of the starting point for looking alternatives -Julia-)
[ Create new methods from other methods in Julia it's easy... But I'm affarid that maybe It's going to come more difficult to maintain or update, because you need to know about new methods -and with growing library can be very difficult- ]

Back to types on Bio:

I'm thinking that my first idea of a Composite Type (for load ids and annotations in the same object) for Sequences (Nucleic and Amino Acids) and other for Alignments based on a Char of 8 bits and ASCIIstrings (with optionals methods for check the IUPAC code) can be useful [I'm not sure about to create a Seq character of 8 bits, and array and string of this or not. Basically is going to be almost the same ASCIIstrings, and maybe use the base String are going to be better]. Composite types are going to have a look & fell like other Bio* projects and Sequences are going to have same behavior than arrays and strings of base (easy to learn one time you learn Julia) [ This two thing are going useful for People moving into Julia, a lot of People working on Bioinformatics comes from Biology Sciences and learn to program on the fly and usually don't change of language after learn one -the reason of popularity of BioPerl, for example, it's history-. Have an easy to learn, coherent and obvious behavior is a good and important feature in this field. ]

Char8:
I know that a Char8 isn't not going to be faster, but it's going to take less memory. And, maybe, create a Char8 can be an expected behavior to chars() [ chars method over a ASCIIstrings (8bits) returns a Char array of 8 bits (ASCII characters) ]

I'm not sure about to create a Seq character of 8 bits, and array and string of this or not. Or if its better create a Char8. Basically a Seq type inside a Composite type is going to be almost the same that ASCIIstrings and Char (but with 8 bits). What do you think?
Really It's not a good idea for base have a Char8?


Thanks

Stefan Karpinski

unread,
Dec 19, 2012, 11:53:25 AM12/19/12
to Julia Users
On Wed, Dec 19, 2012 at 10:40 AM, Diego Javier Zea <dieg...@gmail.com> wrote:
I know that a Char8 isn't not going to be faster, but it's going to take less memory. And, maybe, create a Char8 can be an expected behavior to chars() [ chars method over a ASCIIstrings (8bits) returns a Char array of 8 bits (ASCII characters) ] 

I'm not sure about to create a Seq character of 8 bits, and array and string of this or not. Or if its better create a Char8. Basically a Seq type inside a Composite type is going to be almost the same that ASCIIstrings and Char (but with 8 bits). What do you think?
Really It's not a good idea for base have a Char8?

Strings of Char8s will take use the exact same space per ASCII character as UTF8Strings because UTF-8 uses a single byte for ASCII characters. If you're confused about UTF-8 and Unicode in general, you're certainly not alone, and I would recommend reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. The section on UTF-8 and how it works may help clarify things.

The bottom line is that DNA sequences are not strings and they're definitely not Unicode strings. The fact that they've been crammed into strings in Perl, Python, etc. is merely a historical artifact of those languages not being able to define new, appropriate data types for representing DNA. It becomes increasingly awkward and awful to use strings to represent DNA as those languages improve their Unicode support.

Kevin Squire

unread,
Dec 19, 2012, 12:36:46 PM12/19/12
to julia...@googlegroups.com
Hi Diego,

I've been meaning to speak up for a while, but have been distracted by other things.  I also work with sequencing data in bioinformatics (and am a big fan of Julia).  And I'm also interested in working on a (set of?) bioinformatics package(s) for Julia, though I haven't done much there yet.

It may not be relevant right now, but if you count {U,T} as equivalent and {".","-"} as equivalent, there are really only 16 IUPAC codes for nucleotides.  For very large datasets, it may be worthwhile to have a packed format which does encode two nucleotides into one byte (and/or a 2-bit format as well for ATCG).  It should also be possible to encapsulate BitArrays to do this, although I'm not sure how good the performance would be.  For smaller datasets, one byte per nucleotide position should be fine.

You might also be interested in Carlo Baldassi's julia-fastaread (https://github.com/carlobaldassi/julia-fastaread), which he hinted that he might turn into a package at some point.  At this point, like everyone else, it treats nucleotide sequences like strings. ;-)

I likely won't be available much until January, but I'm quite interested in working on a package or two related to sequencing as well.

Kevin

Diego Javier Zea

unread,
Dec 19, 2012, 6:10:57 PM12/19/12
to julia...@googlegroups.com
Hi Kevin! It's good to hear this :D Actually we are only two persons (supported by our Professors) trying to make this, but more it's better!!! If you want, you can write to me at dieg...@gmail.com ;) Interesting work of Carlo, I'm going to write to him.

Back to Bits: We can not use less than 8 bits, but it's true that it's possible to encode more than one unit on a byte ;) I really don't know if is better or not, but we can testing it.

At moment, I think that a 8 bits definition is going to be something for started.

Best
Reply all
Reply to author
Forward
0 new messages