BadCharException error running maffilter ExtractFeatures

200 views
Skip to first unread message

Ksenia Lavrichenko

unread,
Mar 30, 2017, 3:00:52 PM3/30/17
to MafFilter
Hi,

I just started using maffilter (I downloaded the executable file and Bio+ suit executables and placed then in my PATH). I created an options file and tried running maffilter with it on Ensembl compara MAF alignment of 40 eutherian mammals (release 88) . I sto human chromosome 7, idea being eventually extract the block of alignment in my gene of interest (in human reference). I supplied the matching human annotation file for feature extraction, but the job ended with an error: 

AlphabetException: BadCharException: .. LetterAlphabet::charToInt: Unknown state (DNA alphabet). 

Googling around showed one relevant thread that pointed towards Bpp-raa library. 

I am not a C++ programmer and this is first time I am trying to use this tool, so I am just plain blank how to go about it. 

The screenshot is attached. (my options file first and then the run and an error). 

Thanks for any tips,
Ksenia 


MafFilter_BadCharacter.png

Julien Yann Dutheil

unread,
Mar 31, 2017, 3:04:44 AM3/31/17
to MafFilter
Dear Ksenia,

This looks like an alignment format issue. Maffilter says that one of your sequence contains a do "." character, which is not recognized. It could be because of a parsing issue (it tries to read a sequence which is not one), or because for some reason your alignment does contain dots (as a substitute for gaps maybe?). If you send me a MAF file (as small as possible) that allows to reproduce the error I can give it a look.

Best,

Julien.

Ksenia Lavrichenko

unread,
Apr 4, 2017, 9:32:59 AM4/4/17
to MafFilter
Dear Julien,

Thank you for your answer! It indeed seemed so to me as well, I just assumed that the Ensembl data should be properly formatted. 

The file I was using is from ensembl release-88 compara maf folder: 

In my example the file is *.chr_7_1.maf.gz.  (I also tried other chromosomes with the same effect).
I did not do anything to these files, so it is safe to download and try on one of these. 

For the annotation, I downloaded a corresponding gtf from the same release:

Thank you in advance!

Best regards,
Ksenia 

Julien Yann Dutheil

unread,
Apr 4, 2017, 9:45:51 AM4/4/17
to MafFilter
Dear Ksenia,

The file indeed contains some dots in some sequences. I have no idea where they come from and if they have any particular meaning. I can add an option to consider them as gaps, if you would find that useful, unless they should be considered as N instead? It would be good to know why compara generate them in order to handle them properly.

Best,

Julien.

Julien Yann Dutheil

unread,
Apr 13, 2017, 8:41:35 AM4/13/17
to MafFilter
Hi,

I have implemented an option to allow for dot characters, treating them as gaps (input.dots = as_gap, while input.dots = error, the default, corresponding to the previous behavior of sending an error).
Adding support for "compressed" alignments, where dots indicate identity with a reference sequence proved to involve more modification of the parser than I thought. I have therefore postponed it for now, as it did not seem to have a real need for that at the moment.

Hope this helps,

Julien.

PS: to get the dot_as_gap option you need to compile maffilter from the git repository. It will be included in the next release.

Ksenia Lavrichenko

unread,
Apr 20, 2017, 8:07:58 AM4/20/17
to MafFilter
Dear Julien,

I am delighted to hear of an updated version. However I failed to compile this version (as well as the system administrator in our institute who has root privileges).
The error reported seems to be related to the code itself, in particular the new changes: 

user@server:/usr/local/src/BioPP/maffilter> make install
Scanning dependencies of target info
[  0%] Built target info
Scanning dependencies of target man
[  0%] Built target man
Scanning dependencies of target maffilter
[ 25%] Building CXX object MafFilter/CMakeFiles/maffilter.dir/MafFilter.cpp.o
/usr/local/src/BioPP/maffilter/MafFilter/MafFilter.cpp: In function ‘int main(int, char**)’:
/usr/local/src/BioPP/maffilter/MafFilter/MafFilter.cpp:182:37: error: invalid conversion from ‘short int’ to ‘const char*’ [-fpermissive]
       string dotOption = MafParser::DOT_ERROR;
                                     ^
In file included from /usr/include/c++/4.8/string:52:0,
                 from /usr/include/c++/4.8/bits/locale_classes.h:40,
                 from /usr/include/c++/4.8/bits/ios_base.h:41,
                 from /usr/include/c++/4.8/ios:42,
                 from /usr/include/c++/4.8/ostream:38,
                 from /usr/include/c++/4.8/iostream:39,
                 from /usr/local/src/BioPP/maffilter/MafFilter/MafFilter.cpp:41:
/usr/include/c++/4.8/bits/basic_string.h:490:7: error:   initializing argument 1 of ‘std::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _CharT*, const _Alloc&) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ [-fpermissive]
       basic_string(const _CharT* __s, const _Alloc& __a = _Alloc());
       ^
/usr/local/src/BioPP/maffilter/MafFilter/MafFilter.cpp:187:63: error: no matching function for call to ‘bpp::MafParser::MafParser(boost::iostreams::filtering_istream*, bool, std::string&)’
       currentIterator = new MafParser(&stream, true, dotOption);
                                                               ^
/usr/local/src/BioPP/maffilter/MafFilter/MafFilter.cpp:187:63: note: candidates are:
In file included from /usr/local/src/BioPP/maffilter/MafFilter/MafFilter.cpp:72:0:
/usr/local/apps/biopp/include/Bpp/Seq/Io/Maf/MafParser.h:87:5: note: bpp::MafParser::MafParser(const bpp::MafParser&)
     MafParser(const MafParser& maf):
     ^
/usr/local/apps/biopp/include/Bpp/Seq/Io/Maf/MafParser.h:87:5: note:   candidate expects 1 argument, 3 provided
In file included from /usr/local/src/BioPP/maffilter/MafFilter/MafFilter.cpp:72:0:
/usr/local/apps/biopp/include/Bpp/Seq/Io/Maf/MafParser.h:82:5: note: bpp::MafParser::MafParser(std::istream*, bool, short int)
     MafParser(std::istream* stream, bool parseMask = false, short dotOption = DOT_ERROR) :
     ^
/usr/local/apps/biopp/include/Bpp/Seq/Io/Maf/MafParser.h:82:5: note:   no known conversion for argument 3 from ‘std::string {aka std::basic_string<char>}’ to ‘short int’
MafFilter/CMakeFiles/maffilter.dir/build.make:54: recipe for target 'MafFilter/CMakeFiles/maffilter.dir/MafFilter.cpp.o' failed
make[2]: *** [MafFilter/CMakeFiles/maffilter.dir/MafFilter.cpp.o] Error 1
CMakeFiles/Makefile2:334: recipe for target 'MafFilter/CMakeFiles/maffilter.dir/all' failed
make[1]: *** [MafFilter/CMakeFiles/maffilter.dir/all] Error 2
Makefile:137: recipe for target 'all' failed
make: *** [all] Error 2


Any tips? 

Thank you in advance!
Ksenia

Julien Yann Dutheil

unread,
Apr 20, 2017, 8:57:50 AM4/20/17
to MafFilter
Hi,

My mistake... CLang did not complain about that but gcc does... this is now fixed (maffilter sources, other libs unchanged).

I apologize for the inconvenience,

J.

Ksenia Lavrichenko

unread,
Apr 24, 2017, 8:17:30 AM4/24/17
to MafFilter
Dear Julien,

The new version was successfully compiled on our system. However, the error persists. It could be that I misunderstand how to use the tool, therefore in addition to output and error screenshot I attach my options file. The input file I am using is exactly the same as before - eutherian mammals from Ensemble.

I would like to outline that I made sure to have the latest tool version (it does say that parameter input.dots is missing if it is not supplied and the date of release is the latest).

If you have any hints please let me know! 

Best regards,
Ksenia 




maffilter.png
options_Compara.40_eutherian_mammals_EPO_LOW_COVERAGE.chr7_1.txt

Julien Yann Dutheil

unread,
Apr 24, 2017, 3:15:58 PM4/24/17
to MafFilter
Dear Ksenia,

I'm terribly sorry, the option is as_gaps and not as_gap :s
When set correctly, you should see the message :
Maf 'dotted' alignment input...........: ON

I'm fixing the documentation.

I have run your option file on the data you mentioned, and got the next error:
MafAlignmentParser::nextBlock. Sequence found (choloepus_hoffmanni.scaffold_429401) does not match specified size: 888, should be 941.

This hints at the fact that dots are indeed not gaps, as they count in the sequence length! I have therefore added a new option "as_unresolved" to convert them to 'N'. But with this option I have another error somewhere else. It is therefore very unclear to me how to handle these dots. Digging a bit more into the way these alignments have been generated, it appears that they result from an "extended" EPO pipeline allowing for "fragmented" genomes. I suspect this has sthg to do with that, and that dots correspond to some kind of missing data which could be both a N or a Gap. I do not understand though how sequence lengths are computed there. What I could do is to add an option telling MafFilter to ignore sequence length information and to use the length of what is actually found in the file (the current bahaviour is to compare the real length from the one provided as a double check). Converting dots to gaps or N in combination with this option should work, providing you do not try to interpret the coordinates of the sequence with such dots, as they would probably be erroneous. What do you think?

J.

Ksenia Lavrichenko

unread,
Apr 24, 2017, 4:09:41 PM4/24/17
to MafFilter
Dear Julien,

Thank you so much for investigating this. You are definitely much better equipped with knowledge to tackle the issue. 

Now that you mention it, it makes sense that Ensembl speaks of the number of "blocks" in these maf files (synteny blocks I assume?) and most likely they use dots to stitch together the blocks which causes confusion for MafFilter. I agree with you that it is not straight forward to interpret the "sequence length".

Normally, the reference species is used for coordinates (at least this is what I do :-) ). This will always be unambiguous, and we can just instruct the tool to fetch such coordinates stepping over the deletions but not Ns - which of course raises the question that you correctly pointed out: what of these two is represented by dots. 

I think introducing Ns instead of dots when parsing these synteny blocks should work, if your guess is correct and the dots are indeed some omitted sequence that perhaps did not have similarity enough.  

Best regards,
Ksenia 

PS: thank you so much for quick feedback!



Julien Yann Dutheil

unread,
Apr 25, 2017, 6:09:04 AM4/25/17
to MafFilter
Dear Ksenia,

This is now implemented. Took me a bit more time as I found another bug which is now fixed too :) I'm afraid you have to update and compile everything again... as we are about to release the next version of Bio++, there are quite some changes on the development branch at the moment.
I also attach a modified option file, showing the new options in play (you will just need to update the file paths). Some remarks:
- reference species should not contain chr numbers: homo_sapiens, not homo_sapiens.7. If you want to extract one chromosome only, you can use the SelectChr filter prior to extraction.
- you need to specify an output filter in order to effectively retrieve the data. I extracted genes to fasta files in Fasta// directory (which will need to be created before running maffilter). Other options are of course possible.
- note that some regions have duplicates, you might want to run the Subset filter as a check.

Hope this helps!

Cheers,

Julien.
options_Compara.40_eutherian_mammals_EPO_LOW_COVERAGE.chr7_1.txt

Ksenia Lavrichenko

unread,
Apr 26, 2017, 9:01:10 AM4/26/17
to MafFilter
Dear Julien,

Thank you for correcting the options file and explaining the workflow - this is very useful. 

However our system administrator reported the following errors, it is always a work in progress: 

Scanning dependencies of target info
[  0%] Built target info
Scanning dependencies of target man
[  0%] Built target man
Scanning dependencies of target maffilter
[ 25%] Building CXX object MafFilter/CMakeFiles/maffilter.dir/MafFilter.cpp.o
/usr/local/src/BioPP/maffilter/MafFilter/MafFilter.cpp: In function ‘int main(int, char**)’:
/usr/local/src/BioPP/maffilter/MafFilter/MafFilter.cpp:189:21: error: ‘DOT_ASUNRES’ is not a member of ‘bpp::MafParser’
         dotOption = MafParser::DOT_ASUNRES;
                     ^
/usr/local/src/BioPP/maffilter/MafFilter/MafFilter.cpp:195:74: error: no matching function for call to ‘bpp::MafParser::MafParser(boost::iostreams::filtering_istream*, bool, bool&, short int&)’
       currentIterator = new MafParser(&stream, true, checkSize, dotOption);
                                                                          ^
/usr/local/src/BioPP/maffilter/MafFilter/MafFilter.cpp:195:74: note: candidates are:

In file included from /usr/local/src/BioPP/maffilter/MafFilter/MafFilter.cpp:72:0:
/usr/local/apps/biopp/include/Bpp/Seq/Io/Maf/MafParser.h:87:5: note: bpp::MafParser::MafParser(const bpp::MafParser&)
     MafParser(const MafParser& maf):
     ^
/usr/local/apps/biopp/include/Bpp/Seq/Io/Maf/MafParser.h:87:5: note:   candidate expects 1 argument, 4 provided

In file included from /usr/local/src/BioPP/maffilter/MafFilter/MafFilter.cpp:72:0:
/usr/local/apps/biopp/include/Bpp/Seq/Io/Maf/MafParser.h:82:5: note: bpp::MafParser::MafParser(std::istream*, bool, short int)
     MafParser(std::istream* stream, bool parseMask = false, short dotOption = DOT_ERROR) :
     ^
/usr/local/apps/biopp/include/Bpp/Seq/Io/Maf/MafParser.h:82:5: note:   candidate expects 3 arguments, 4 provided

Julien Yann Dutheil

unread,
Apr 26, 2017, 9:37:49 AM4/26/17
to MafFilter
Dear Ksenia,

It looks like you have not updated the Bio++ libraries (in particular bpp-seq-omics) before compiling maffilter, isn't that so?

J.
Reply all
Reply to author
Forward
0 new messages