-viral
I have a bunch of C code we can reuse for fast ranged indexing,
reversing, and copy_to with offsets.
If we can extend this to packed arrays of n-bit data types (especially
n < 16), those who like to sequence genes might be interested, though
of course it is non-trivial.
About copied code: that's right; the main reason was that I couldn't
make BitArray be a StridedArray without touching array.jl. This could
create some ugly maintenance issues, and I think DenseArray may make
sense; if nothing else, to have automatic promotions of BitArrays.
About voodoo: there's only a little really, in nnz. I agree we need
moar of that :)
copy_to with offsets has a partial implementation in _jl_copy_chunks
but that could be generalized (currently it copies the whole source
array to some position in a target array). It was in my todo list,
mainly for speeding up concatenation.
del with Range1 was a nightmare. It could probably be done more
elegantly, too, but at least it works fine.
One more issue, about function naming: I have used the prefix bit- for
BitArray-specific operations, e.g. bitzeros, bitones, bitrand etc. I
think this makes sense, but there's a potential confusion between
bitrand and randbit (I surely do get confused); I didn't have any good
idea about how to solve this naming clash. Also, it's impossible to
distinguish a BitVector from a Vector in the REPL by just showing it.
Ok, pull request issued.
Of course, one can call 'make bitarray' from within test to verify
everything's working. Testing is fairly slow but I tried to be
thorough given the amount of pitfalls which are there, waiting between
the bits.About copied code: that's right; the main reason was that I couldn't
make BitArray be a StridedArray without touching array.jl. This could
create some ugly maintenance issues, and I think DenseArray may make
sense; if nothing else, to have automatic promotions of BitArrays.About voodoo: there's only a little really, in nnz. I agree we need
moar of that :)copy_to with offsets has a partial implementation in _jl_copy_chunks
but that could be generalized (currently it copies the whole source
array to some position in a target array). It was in my todo list,
mainly for speeding up concatenation.del with Range1 was a nightmare. It could probably be done more
elegantly, too, but at least it works fine.One more issue, about function naming: I have used the prefix bit- for
BitArray-specific operations, e.g. bitzeros, bitones, bitrand etc. I
think this makes sense, but there's a potential confusion between
bitrand and randbit (I surely do get confused); I didn't have any good
idea about how to solve this naming clash. Also, it's impossible to
distinguish a BitVector from a Vector in the REPL by just showing it.
Ah yes, the wording of my last sentence was unfortunate to say the
least. It should read as: given the current convention for showing
Vectors, it is impossible to be sure of the element type (e.g. Int64
vs Int32 vs. Integer) and this gets even worse with BitVectors, as one
cannot even see the container type. This makes so that bitrand and
randbit produce indistiguishable outputs in the REPL. So the course of
action would either be (matching Patrick's suggestions):
1) have BitArrays move away from the convention, e.g. by using a
summary, or by prepending something to the square brackets ->
specialized show(::Bitvector)
2) change the convention and use a summary in all cases -> change
show(::AbstractVector)
3) ignore the issue
I personally favour option number 2, which could also show the length
of the vector, just as is done for all other arrays, but I don't know
the reason for the different behaviour, or if there were previous
discussions about that.
One more issue, about function naming: I have used the prefix bit- for BitArray-specific operations, e.g. bitzeros, bitones, bitrand etc. I think this makes sense, but there's a potential confusion between bitrand and randbit (I surely do get confused); I didn't have any good idea about how to solve this naming clash.
Also, it's impossible to distinguish a BitVector from a Vector in the REPL by just showing it.
julia> bitrand(5)5-element BitArray:01101julia> randbit(5)5-element Int64 Array:01010
What if we wait for the time when namespaces will be available
instead, and see how this is going to work out? My perplexity arises
because I suppose it's likely bitarrays will stay out of base and be
used under special circumstances; therefore, removing the current
randbit (and making it an alias to bitrand) is going to force loading
bitarray.jl (perhaps just for that).
Ok, let's wait for bitarray to get into base then before making
randbit be an alias to bitrand (or remove it entierly).
BTW the usage of bitarrays to store Bool values was one of the reasons
I was not 100% sure of BitArray <: AbstractArray{Int} and all its
consequences I mentioned in the first email in this thread.
Maybe we should have BitArrays with type then, possibly restricting
the type to integers? I must admit, I never thought about it before,
but I think it makes sense for dispatch, even if the internal
representation is of course the same.
2) I tried to port all kinds of operations available to Arrays; as a guideline, the result is always a BitArray whenever it is safe to assume that a BitArray can hold the result, while promotion to Array{Int} happens otherwise. For example, given two BitArrays b1 and b2:
~b1 will give a BitArray
> Not quite what I was expecting. It does, of course, work with Bool, but
> then we're restricted to an 8-bit underlying representation.
Sorry but I don't understand this remark about being restricted to an
8-bit underlying representation for Bools (BitArrays always use the
same representation, type is only used for dispatch).
> Would it be
> okay to go back to the old (~) function (currently specialized to Bool),
> which explicitly flipped the underlying bits? Or is the current behavior of
> (~) intentional?
The behaviour of ~ is intentional and mimics what happens with Arrays
of the same type, which is the main principle I followed when
introducing types in BitArrays. Therefore, ~ only flips bits (and
returns a BitArray) with BitArray{Bool}, in all other cases it returns
Arrays. One way to get the Bool's behaviour on other bitarray types is
using reinterpret, e.g. you could define:
flipbits{T}(b::BitArray{T}) =
reinterpret(T, ~reinterpret(Bool, b))
(note: convert would produce the same result but it copies the data,
reinterpret doesn't).
Another option could be xor'ing with a bitarray of ones, but that
would be slower I think.
> I also need logical shift functions (>>>) and (<<). If you (Carlo) haven't
> thought these through, I'll attempt an implementation and generate a pull
> request, if that's okay.
That's a good idea, and I don't currently have plans for it (probably,
circular shift functions would be useful as well).
> implement a Bit type, which is virutually identical to Bool, but which
> semantically allows me to treat a BitArray as an array of bits
> alternatively, referring back to Jeff's comments, extend BitArray (or
> implement separately) a PackedArray type, which would allow arrays of
> (currently non-existent) Uint4, Uint2, Uint1, etc.. Bit could be an alias
> for Uint1.
>
> (~) would also work as I expect for these cases.
>
> Any thoughts on these?
In some sense, Bool already is the type representing a Bit in Julia.
It "just happens" to also be used as a truth value when evaluating
expressions. If you don't care about algebra, you should definitely
use Bool as the type of choice in BitArrays, I don't see any real
problem in treating a BitArray{Bool} as an array of bits (corrrect me
if I'm wrong). Note that you can do things like b[5]=1 even if b is a
BitArray{Bool}, and that when you read out the value you can treat it
like it was 1 even though it prints as "true" (in fact, printing is
the only thing which may be a little annoying with Bools intended as
Bits).
The fact that Ints are the "default" bitarray type is somehow
questionable: they are only the default with bitones, bitzeros and
bitrand, but there are also bitfalses and bittrues which give you Bool
bitarrays. It seems to me that in this way the name of the function
reflects better the actual output (least surprising behaviour); I'd
still like to hear others opinions on that however.
If instead you care about algebra, you may a) use a "true" numeric
type and have it behave as expected, or even b) define you own numeric
type (for example, someone may want to work in modulo 2).
Finally, a flipbits functions can be implemented trivially, I'll
surely add it at some point.
Having a PackedArray type would be great and I was planning to try
implementing it (once bitarrays are settled), but it's non-trivial and
it could take a while to even have some basic functionality.
> In my case, I'm actually trying to represent genetic sequence as an array of
> bits. The current implementation is in C++ and uses 2 bitsets to encode
> values. I'd much rather have a packed array of Uint2 or Uint4.
Now that I think about it, it should be possible to define a type like
type GeneSequence
b1::BitVector
b2::BitVector
end
and use (b1[i],b2[i]) to represent a base in {'A','C','G','T'}. Then,
it would just be a matter of defining a few operators with some very
simple definitions. Its performance shouldn't be horribly worse than
that of contiguous packing of couples of bits: you have to operate on
two bitarrays every time, but the operations are simpler. This could
even be easily generalized to any number of bits.
If you're ok with it I could prototype this quickly, I guess.
>> > I also need logical shift functions (>>>) and (<<). If you (Carlo)
>> > haven't
>> > thought these through, I'll attempt an implementation and generate a
>> > pull
>> > request, if that's okay.
>>
>> That's a good idea, and I don't currently have plans for it (probably,
>> circular shift functions would be useful as well).
type NucleotideSeq <: String
data::BitArray
bits::Int
end
import Base.length, Base.next
length(s::NucleotideSeq) = length(s.data)/s.bits
next(s::NucleotideSeq, i::Int) = NucleotideSeq(s.data[i:(i+s.bits-1)],2),i+s.bits
a=NucleotideSeq(convert(BitArray,[true; false; true; true; false; false; false; true]),2);
import Base.done, Base.isempty
done(s::NucleotideSeq,i) = (i > (length(s.data)-s.bits+1))
isempty(s::NucleotideSeq) = done(s,start(s))
ref(s::NucleotideSeq, i::Int) = next(s,i)[1]
ref(s::NucleotideSeq, i::Integer) = NucleotideSeq(s.data[int(i)],2)
ref(s::NucleotideSeq, x::Real) = NucleotideSeq(s.data[to_index(x)],2)
ref{T<:Integer}(s::NucleotideSeq, r::Range1{T}) = NucleotideSeq(s.data[int(first(r)):int(last(r))],2)
import Base.show, Base.print, Base.write
print(io::IO, s::NucleotideSeq) = for c in s print(convert(Array{Int,1},c.data)) end
write(io::IO, s::NucleotideSeq) = print(io, s)
show(io::IO, s::NucleotideSeq) = print(io,s)
julia> a # I'm thinking on a Dict for printing as Char
#[1, 0][1, 1][0, 0][0, 1]