How to locate the bit in bits string?

Li Wang

unread,

Apr 28, 2009, 10:36:56 AM4/28/09

to pytho...@python.org

Hi:

If I use an integer to represent bits:
e.g. 99 represents '1100011'

How can I locate, say the second bit of 99(i.e. '1')?

Although bin(99)[4] could be used to locate it, this transform cost
too much memory (99 only needs 2Bytes, while string '1100011' needs
7Bytes).

Anyone knows how to locate the second bit without using bin() function?

Thank you very much:D

--
Li
------
Time is all we have
and you may find one day
you have less than you think

Tim Chase

unread,

Apr 28, 2009, 11:03:21 AM4/28/09

to Li Wang, pytho...@python.org

Li Wang wrote:
> Hi:
>
> If I use an integer to represent bits:
> e.g. 99 represents '1100011'
>
> How can I locate, say the second bit of 99(i.e. '1')?
>
> Although bin(99)[4] could be used to locate it, this transform cost
> too much memory (99 only needs 2Bytes, while string '1100011' needs
> 7Bytes).
>
> Anyone knows how to locate the second bit without using bin() function?

You mean

def get_bit(number, bit):
return (number >> bit) & 1

?

-tkc

tuxagb

unread,

Apr 28, 2009, 11:17:38 AM4/28/09

to

If we consider 8 bit, a solution may be the follow:

>>> a = 99 # bin(a) = 0b1100011
>>> 0b11111101 & a
97
>>> 0b11101111 & a
99

as you view, you set the nth bit to 0, and view if the result is same
as 'a'. If is same
then the nth bit doesn't set. If you write the above code in a loop,
you can test which bit
you want.

Hi.

Li Wang

unread,

Apr 28, 2009, 11:24:02 AM4/28/09

to Tim Chase, pytho...@python.org

2009/4/29 Tim Chase <pytho...@tim.thechases.com>:

> Li Wang wrote:
>>
>> Hi:
>>
>> If I use an integer to represent bits:
>> e.g. 99 represents '1100011'
>>
>> How can I locate, say the second bit of 99(i.e. '1')?
>>
>> Although bin(99)[4] could be used to locate it, this transform cost
>> too much memory (99 only needs 2Bytes, while string '1100011' needs
>> 7Bytes).
>>
>> Anyone knows how to locate the second bit without using bin() function?
>

> You mean
>
> def get_bit(number, bit):
> return (number >> bit) & 1
>
> ?
>

Hummm, I have tried this method too, the problem is its time
complexity. If the length of my bits is n, then the time complexity is
O(n). When I tried to implement this in practice, it did consume a lot
of time.

So do you know how could I locate the bit in O(1) time? Transform it
into a string is a method, but takes too much space (when I try to
process a 2M file, it used more than 100M memory.....).

Thank you very much.

> -tkc

tuxagb

unread,

Apr 28, 2009, 11:28:17 AM4/28/09

to

On 28 Apr, 17:24, Li Wang <li.wan...@gmail.com> wrote:
> 2009/4/29 Tim Chase <python.l...@tim.thechases.com>:

The my solution is good, but the Tim's solution is better, and it is O
(1) in time and space.
What is you search? I dont'know you general problem, but search the
value of a bit in a 2M file ...
is strange .....

Hi.

Tim Chase

unread,

Apr 28, 2009, 12:11:39 PM4/28/09

to Li Wang, pytho...@python.org

Li Wang wrote:
> 2009/4/29 Tim Chase <pytho...@tim.thechases.com>:
>> Li Wang wrote:

>>> If I use an integer to represent bits:

[snip]

> Hummm, I have tried this method too, the problem is its time
> complexity. If the length of my bits is n, then the time complexity is
> O(n). When I tried to implement this in practice, it did consume a lot
> of time.

I'm not sure I follow here -- your original post said that you
have an integer. With an integer, my function is O(1) as you
request. It sounds like you're *not* representing your bits as
an integer. Either that, or it's a gargantuan number (say, more
than 64 bits? Maybe more than 1k bits?) that you want to
bit-slice. In that case, you need to know a bit more about the
storage structure. However, assuming it's a raw bit-stream, you
might be able to do something like this (untested, but should be
fairly close)

data = file('source.bin').read()
def get_bit(source, bit):
idx, bit = divmod(bit, 8)
byte = ord(source[len(source) - (1+idx)])
return (byte >> bit) & 1

print get_bit(data, 3141592) # the 3,141,592nd bit

you might have to tweak for endian'ness.

-tkc

Tim Chase

unread,

Apr 28, 2009, 12:49:43 PM4/28/09

to Li Wang, pytho...@python.org

>> data = file('source.bin').read()
>> def get_bit(source, bit):
>> idx, bit = divmod(bit, 8)
>> byte = ord(source[len(source) - (1+idx)])
>> return (byte >> bit) & 1
>

> My understanding is: when doing this step, every bit in the byte will
> be shifted bit-long. If it is get_bit(data, 100), and the source.bin
> has 200bits in it. this single step (i.e. return (byte >> bit) & 1)
> will take 200 + 199 + 198 + ... + 101 times bit shifting operation.
> That's my understanding of the this function.

The function extracts the byte that contains the bit of interest,
and then extracts the corresponding bit from that byte. More
verbosely:

idx, bit = divmod(bit, 8)

# idx is now the Nth byte from the end we want
# bit is now 0-7 for the bit inside the corresponding byte

offset = len(source) - (1+idx)
# convert the left-hand idx to a right-hand idx

byte = ord(source[offset]))
# we now have the byte containing the bit we want

# my original O(1) bit-extract

return (byte >> bit) & 1

it only shifts once because it can precisely target the exact
byte without having to visit the other bytes. Trust me...it's
O(1) for extracting a single bit (unless indexed access to "data"
is not O(1), but that would assume something other than a string
or list data-type...like a data-stream or generator).

-tkc

Tim Chase

unread,

Apr 28, 2009, 2:44:13 PM4/28/09

to Li Wang, pytho...@python.org

> I want to concatenate two bits string together: say we have '1001' and
> '111' which are represented in integer. I want to concatenate them to
> '1001111' (also in integer form), my method is:
> ('1001' << 3) | 111
> which is very time consuming.

You omit some key details -- namely how do you know that "1001"
is 4 bits and not "00001001" (8-bits)? If it's a string (as your
current code shows), you can determine the length. However, if
they are actually ints, your code should work fine & be O(1).

This can be abstracted if you need:

def combine_bits(int_a, int_b, bit_len_b):
return (int_a << bit_len_b) | int_b

a = 0x09
b = 0x07
print combine_bits(a, b, 3)

However, if you're using gargantuan ints (as discussed before),
it's a lot messier. You'd have to clarify the storage structure
(a byte string? a python long?)

-tkc

PS: You may want to CC the mailing list so that others get a
crack at answering your questions...I've been adding it back in,
but you've been replying just to me.

David Smith

unread,

Apr 28, 2009, 3:22:22 PM4/28/09

to

So... I can only conclude you are looking for bit x in the entirety of a
file. First you'll have to figure out what byte to look at w/ a little
integer division, then read to that point and test for the specific bit
-- I'm thinking a bitwise and operation with a bit mask. Should be
really fast.

--David

Li Wang

unread,

Apr 28, 2009, 8:39:12 PM4/28/09

to Tim Chase, pytho...@python.org

2009/4/29 Tim Chase <pytho...@tim.thechases.com>:

>> I want to concatenate two bits string together: say we have '1001' and
>> '111' which are represented in integer. I want to concatenate them to
>> '1001111' (also in integer form), my method is:
>> ('1001' << 3) | 111
>> which is very time consuming.
>
> You omit some key details -- namely how do you know that "1001" is 4 bits
> and not "00001001" (8-bits)? If it's a string (as your current code shows),
> you can determine the length. However, if they are actually ints, your code
> should work fine & be O(1).

Actually, what I have is a list of integer numbers [3,55,99,44], and
by using Huffman coding or fixed length coding, I will know how the
bits-length for each number. When I try to concatenate them (say
10,000 items in the list) all together, the speed is going down
quickly (because of the shifting operations of python long).

>
> This can be abstracted if you need:
>
> def combine_bits(int_a, int_b, bit_len_b):
> return (int_a << bit_len_b) | int_b
>
> a = 0x09
> b = 0x07
> print combine_bits(a, b, 3)
>
> However, if you're using gargantuan ints (as discussed before), it's a lot
> messier. You'd have to clarify the storage structure (a byte string? a
> python long?)

I am using a single python long to store all the items in the list
(say, 10,000 items), so the work does become messier...

>
> -tkc
>
> PS: You may want to CC the mailing list so that others get a crack at
> answering your questions...I've been adding it back in, but you've been
> replying just to me.

Sorry, this is the first time I am using mail-list....and always
forgot "reply to all"

Thank you very much:D

>
>
>
>

--

Tim Chase

unread,

Apr 28, 2009, 9:34:07 PM4/28/09

to Li Wang, pytho...@python.org

>> You omit some key details -- namely how do you know that
>> "1001" is 4 bits and not "00001001" (8-bits)? If it's a
>> string (as your current code shows), you can determine the
>> length. However, if they are actually ints, your code
>> should work fine & be O(1).
>
> Actually, what I have is a list of integer numbers
> [3,55,99,44], and by using Huffman coding or fixed length
> coding, I will know how the bits-length for each number. When
> I try to concatenate them (say 10,000 items in the list) all
> together, the speed is going do

Ah, this makes more sense!

I'd try creating a bitwriter object that takes fragments of bytes
and writes them sequentially to a file-like object. Something
like this untested class:

class BitWriter:
def __init__(self, outstream):
self.out = outstream
self.datum = 0 # the data accrued
self.bits = 0 # the number of bits we've accrued
def write(self, byte_data, bit_count):
# maybe add some code to ensure
# that byte_data doesn't have any
# significant bits beyond bit_count
datum = (self.datum << bit_count) | byte_data
self.bits += bit_count
if self.bits >= 8:
overflow = self.bits - 8
self.out.write(chr(datum >> overflow))
#discard the stuff we wrote
datum -= (datum >> overflow) << overflow
self.bits -= 8
def close(self):
if self.bits: # we have unflushed data
data = self.datum << (8 - self.bits)
self.out.write(chr(data))
self.out.close()

out = file('compressed.dat', 'wb')
bw = BitWriter(out)
for data, bit_count in source():
out.write(data, bit_count)
out.close()

As mentioned, it's 100% untested, I don't know what sort of
performance it has, nor the endian'ness of your data streams, so
you might have to twiddle bits in the opposite directions and
test the performance.

-tkc

Li Wang

unread,

Apr 28, 2009, 11:09:24 PM4/28/09

to Tim Chase, pytho...@python.org

2009/4/29 Tim Chase <pytho...@tim.thechases.com>:

The method looks great, I thought I got your idea: when the
len(bits)>8, we output 8bits and discard them. I believe this will
work efficiently, because I have implemented the algorithm by
manipulating strings ('111001' character string, not bits) in the
similar way, and worked quite fast.:), Thank you very much.

You mentioned "twiddle bits in the opposite directions", that reminds
me of another possible solutions for this problem. Say, I have bits
'1101' and '110', instead of doing (1101<<3) |110, we could try the
opposite way:
firstly, inverse the '1101' to '1011', and '110' to '011'
then, (011 << 4) | 1011,
if there comes another bit '10100'
inverse it again 10100 to 00101
add it to the long bits: (00101 << 7) | 0111011
so on and so forth.

After all the process is done, we inverse the final long bits

Because in this process you do not need to shift the whole long bits,
but only shift the much shorter bits every time, so the time
efficiency should be not bad.

However, there is one issue I cannot not figure it out: How to inverse
the bits efficiently? do you have any idea?

Thank you very much:!:)
Best regards,
Li

>
> -tkc

Steven D'Aprano

unread,

Apr 29, 2009, 12:23:08 AM4/29/09

to

On Wed, 29 Apr 2009 00:36:56 +1000, Li Wang wrote:

> Although bin(99)[4] could be used to locate it, this transform cost too
> much memory (99 only needs 2Bytes, while string '1100011' needs 7Bytes).

This is Python, not C. Your figures are completely wrong.

>>> sys.getsizeof(99)
12
>>> sys.getsizeof('1100011')
31

Everything in Python is an object. If you're expecting C performance for
bit-twiddling operations, you aren't going to get it.

--
Steven

casevh

unread,

Apr 28, 2009, 10:52:00 PM4/28/09

to

On Apr 28, 5:39 pm, Li Wang <li.wan...@gmail.com> wrote:
> 2009/4/29 Tim Chase <python.l...@tim.thechases.com>:
>

Using GMPY (http://code.google.com/p/gmpy/) may offer a performance
improvement. When shifting multi-thousand bit numbers, GMPY is several
times faster than Python longs. GMPY also support functions to scan
for 0 or 1 bits.

Dave Angel

unread,

Apr 29, 2009, 10:35:20 PM4/29/09

to pythonlist

casevh wrote:
> On Apr 28, 5:39 pm, Li Wang <li.wan...@gmail.com> wrote:
>
>> 2009/4/29 Tim Chase <python.l...@tim.thechases.com>:
>>
>>
>>>> I want to concatenate two bits string together: say we have '1001' and
>>>> '111' which are represented in integer. I want to concatenate them to
>>>> '1001111' (also in integer form), my method is:
>>>> ('1001' << 3) | 111
>>>> which is very time consuming.
>>>>
>>> You omit some key details -- namely how do you know that "1001" is 4 bits
>>> and not "00001001" (8-bits)? If it's a string (as your current code shows),
>>> you can determine the length. However, if they are actually ints, your code
>>> should work fine & be O(1).
>>>
>> Actually, what I have is a list of integer numbers [3,55,99,44], and
>> by using Huffman coding or fixed length coding, I will know how the
>> bits-length for each number. When I try to concatenate them (say
>> 10,000 items in the list) all together, the speed is going down
>> quickly (because of the shifting operations of python long).
>>
>>
>>
>>
>>> This can be abstracted if you need:
>>>
>>> def combine_bits(int_a, int_b, bit_len_b):
>>> return (int_a << bit_len_b) | int_b
>>>

>>> a =x09
>>> b =x07

>>> print combine_bits(a, b, 3)
>>>
>>> However, if you're using gargantuan ints (as discussed before), it's a lot
>>> messier. You'd have to clarify the storage structure (a byte string? a
>>> python long?)
>>>
>> I am using a single python long to store all the items in the list
>> (say, 10,000 items), so the work does become messier...
>>
>
> Using GMPY (http://code.google.com/p/gmpy/) may offer a performance
> improvement. When shifting multi-thousand bit numbers, GMPY is several
> times faster than Python longs. GMPY also support functions to scan
> for 0 or 1 bits.
>
>
>>> -tkc
>>>
>>> PS: You may want to CC the mailing list so that others get a crack at
>>> answering your questions...I've been adding it back in, but you've been
>>> replying just to me.
>>>
>> Sorry, this is the first time I am using mail-list....and always
>> forgot "reply to all"
>>
>> Thank you very much:D
>>
>>
>>
>> --
>> Li
>> ------
>> Time is all we have
>> and you may find one day
>> you have less than you think
>>

Seems to me you'd do better with a byte array than a Python long. At least while you're doing all those shifts. That way you do the shifts in an int, then update a byte in the array whenever you have 8 bits accumulated.

You can then look up a bit in the array trivially. And if you really need a Python long, I predict the single conversion will be much faster than doing all those shifts on the long as you go.