Re: Comment on Introduction in biopieces

31 views
Skip to first unread message

Martin Asser Hansen

unread,
Jun 25, 2014, 4:25:57 AM6/25/14
to congma...@tuebingen.mpg.de, biop...@googlegroups.com
Hi congmao (CC biopieces google group),

The ordering of elements in Perl hash is not stable (this is not a problem for the Biopieces written in Ruby where hash order is stable). So the hash order may be different between platforms and/or Perl versions. I see this can cause trouble with the test suite, but I guess we have been lucky so far, since noone have reported this problem before.

It is possible to make Perl hash order stable using Tie::IxHash, but I suspect that it will have some performance overhead. Also, it is a bit tricky to implement across all relevant Biopieces if we wanted a setting to use stable hashes for the test suite. 

The simple solution would be to sort the keys in the records in the test suite, but then a fair amount of the expected output records also need sorting. Hm, I can probably cook up a diff_sort that will do it automatically. I'll think some more about it.



Cheers,


Martin


On Wed, Jun 25, 2014 at 2:59 AM, <biop...@googlecode.com> wrote:
Comment by congmao....@tuebingen.mpg.de:

Hi Martin,

I found a very strange problem when I ran the "bp_test", when I test it under Ubuntu 12.04, it works perfect, always right:)
But when I test under Mint13, the tests result are correct, but the records are not always in the same order (for example, add_ident), just as you said in the web: "Since the records basically are hash structures this mean that the order of the keys in the stream is unordered, and in the above example it is pure coincidence that HIT_BEG is displayed before HIT_END."
So then the test scripts need to improve, which need to consider sort the running output records and the expected output records, then diff, right?
I don't know why it always generate the same order records under Ubuntu, but not under Mint, have some ideas?

best,
Congmao

For more information:
https://code.google.com/p/biopieces/wiki/Introduction

Martin Asser Hansen

unread,
Jun 25, 2014, 5:14:27 AM6/25/14
to congma...@tuebingen.mpg.de, biop...@googlegroups.com
OK, I added sorting or records in the test suite. I got all green tests on my Mac and our Linux server.

bp_update && bp_test


Martin

Congmao

unread,
Jun 25, 2014, 5:19:05 AM6/25/14
to Martin Asser Hansen, biop...@googlegroups.com
Great! I will test on the Mint to see the results. 
By the way, Martin, does the biopieces support to develop in Bash scripts? I find it supports perl, python and ruby very well

best,
Congmao 

Martin Asser Hansen

unread,
Jun 25, 2014, 5:24:44 AM6/25/14
to biop...@googlegroups.com
No bash support in Biopieces (though the testing suite is in bash). Bash does not handle the concept of records very well unless these are a single line. I guess with Bash and AWK something could be made to work, but I am not gonna try.

Cheers,


Martin


--
You received this message because you are subscribed to the Google Groups "biopieces" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biopieces+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

congmao wang

unread,
Jun 25, 2014, 6:15:35 AM6/25/14
to biop...@googlegroups.com, ma...@maasha.dk
Hmm, yes, bash itself not supports well for multiline records.
Currently, uniq_seq will not generate the stable result especially with option -c (yes, the result should be correct), I think it should be also caused by hash

Inside the bp_test:
uniq_seq -I $BP_DIR/bp_test/in/uniq_seq.in -c

You will sometimes get GCAT, sometimes get ATGC, it is better to change the expected output or change the test input to make the test pass

best,
Congmao

Martin Asser Hansen

unread,
Jun 25, 2014, 6:22:56 AM6/25/14
to biop...@googlegroups.com
So the Biopieces output will still be using unordered records, but the assertion where we compere the result and expected result is sorting the lines in each record:



Martin

congmao wang

unread,
Jun 25, 2014, 6:53:33 AM6/25/14
to biop...@googlegroups.com, ma...@maasha.dk
It seems now not the problem of your sorting scripts.
Because your input test case for uniq_seq is quite extreme (they are reverse-complement).
SEQ_NAME: test1
SEQ: ATGC
SEQ_LEN: 4
---
SEQ_NAME: test2
SEQ: ATGC
SEQ_LEN: 4
---
SEQ_NAME: test3
SEQ: GCAT
SEQ_LEN: 4
--- Both the result of GCAT or ATGC should be right if you run 'uniq_seq -I $BP_DIR/bp_test/in/uniq_seq.in -c', because in the internal, the hash order is not stable, but your expected out is always GCAT (this maybe right if the hash order stable).  

Congmao

congmao wang

unread,
Jun 25, 2014, 6:59:46 AM6/25/14
to biop...@googlegroups.com, ma...@maasha.dk
For the test purpose, sorting assertion is useful, but anyway, it is always better to make each biopieces utility's output the records in a stable way:)

Martin Asser Hansen

unread,
Jun 25, 2014, 7:14:13 AM6/25/14
to biop...@googlegroups.com
Well that is a most annoying case :o). I agree that it would be neat with stable records, however, I still think introducing the Tie::IxHash module is overkill. I'd rather port that script to Ruby which has stable hash order. Any other Biopieces causing trouble?


Cheers,


Martin

congmao wang

unread,
Jun 25, 2014, 7:44:43 AM6/25/14
to biop...@googlegroups.com, ma...@maasha.dk
There is no similar trouble till now :o), it seems now the test scripts are not covered for all utilities, I will report if I have found more

cheers,
Congmao
Reply all
Reply to author
Forward
0 new messages