PiGX ChIP-seq question

29 views
Skip to first unread message

Pia.Raut...@mdc-berlin.de

unread,
Aug 6, 2020, 8:59:34 AM8/6/20
to pi...@googlegroups.com
Dear maintainers of the PiGX ChIP-seq pipeline,

I am a Ph.D. student from AG Ohler at BIMSB and currently trying to run the PiGX ChIP-seq pipeline. I get the following error (one out of many, all with the same problem):

```
/gnu/store/n8awazyldv9hbzb7pjcw76hiifmvrpvd-coreutils-8.32/bin/mv: cannot stat '/fast/AG_Ohler/prauten/data/Schuelke_Dummy_ChiP-seq/pigx/mm9/out/Trimmed/Trim_Galore/MyoD_control/MyoD_control_trimmed.fq.gz': No such file or directory
[Thu Aug 6 13:42:35 2020]
Error in rule trim_galore_se:
    jobid: 96
    output: /fast/AG_Ohler/prauten/data/Schuelke_Dummy_ChiP-seq/pigx/mm9/out/Trimmed/Trim_Galore/MyoD_control/MyoD_control_R.fastq.gz
    log: /fast/AG_Ohler/prauten/data/Schuelke_Dummy_ChiP-seq/pigx/mm9/out/Log/MyoD_control/trim_galore_MyoD_control.log (check log file(s) for error message)
    shell:
        /gnu/store/fd4pwx2rh9389amqmb5jva3xin9g40hi-trim-galore-0.6.1/bin/trim_galore -o /fast/AG_Ohler/prauten/data/Schuelke_Dummy_ChiP-seq/pigx/mm9/out/Trimmed/Trim_Galore/MyoD_control /fast/AG_Ohler/prauten/data/Schuelke_Dummy_ChiP-seq/pigx/mm9/in/MyoD_control.fastq >> /fast/AG_Ohler/prauten/data/Schuelke_Dummy_ChiP-seq/pigx/mm9/out/Log/MyoD_control/trim_galore_MyoD_control.log 2>&1 ; /gnu/store/n8awazyldv9hbzb7pjcw76hiifmvrpvd-coreutils-8.32/bin/mv /fast/AG_Ohler/prauten/data/Schuelke_Dummy_ChiP-seq/pigx/mm9/out/Trimmed/Trim_Galore/MyoD_control/MyoD_control_trimmed.fq.gz /fast/AG_Ohler/prauten/data/Schuelke_Dummy_ChiP-seq/pigx/mm9/out/Trimmed/Trim_Galore/MyoD_control/MyoD_control_R.fastq.gz
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
```

I checked whether this (these) files indeed don't exist, and it seems that the output of the trimming that the pipeline generates is not gzipped.

I see files
```.../out/Trimmed/Trim_Galore/sampleX/sampleXl_R.fastq```
instead of the required
```.../out/Trimmed/Trim_Galore/sampleX/sampleXl_R.fastq.gz```

I downloaded pigx-chipseq via guix.

Please let me know if you need more information.

Thanks for your help,
Pia 

Alexand...@mdc-berlin.de

unread,
Aug 7, 2020, 8:47:47 AM8/7/20
to Pia.Raut...@mdc-berlin.de, Alexand...@mdc-berlin.de, pi...@googlegroups.com
Hi Pia,

This error is actually quite cryptic, but thank you for investigating so much of it. 

I peaked into your genome folder and saw that the input fasta file is actually a link to another file, 
So I thought you could maybe try to change the path to the real file, 
as right now the filetype detected with magic is 'inode/symlink’, 
which is why you get the 'Unknow input genome file format’ error. 

I’ll try to figure out in future wether following the symlink is possible somehow, or at least throw an early error. 

I hope this helped, but let me know if there are more problems.

Best,
Alex




Alexander Blume (Gosdschan)

PhD Student
Berlin Institute for 
Medical Systems Biology (BIMSB)
at Max Delbrueck Center (MDC)
Hannoversche Str. 28
10115 Berlin, Germany











On 7. Aug 2020, at 13:06, Rautenstrauch, Pia <Pia.Raut...@mdc-berlin.de> wrote:

Hi Alex,

Thanks a lot for the fast reply! Manually gzipping the files and rerunning the pipeline works - the pipeline continues. 


Unfortunately, I ran into a further weird bug in the "linking the genome fasta" step. 

```
[Fri Aug  7 12:00:09 2020]
Job 99: 
            Linking genome fasta:
                input : /fast/AG_Ohler/prauten/data/Schuelke_Dummy_ChiP-seq/pigx/mm9/genome/mm9_primary_assembly.fa
                output: /fast/AG_Ohler/prauten/data/Schuelke_Dummy_ChiP-seq/pigx/mm9/out/Bowtie2_Index/Main/mm9_primary_assembly.fa
        

Job counts:
count jobs
1 link_genome
1
[Fri Aug  7 12:00:12 2020]
Error in rule link_genome:
    jobid: 0
    output: /fast/AG_Ohler/prauten/data/Schuelke_Dummy_ChiP-seq/pigx/mm9/out/Bowtie2_Index/Main/mm9_primary_assembly.fa
    log: /fast/AG_Ohler/prauten/data/Schuelke_Dummy_ChiP-seq/pigx/mm9/out/Log/link_Main_genome_mm9_primary_assembly.log (check log file(s) for error message)

SystemExit in line 45 of /gnu/store/nxk52pcmvizq6vlb4xmg2x0c3s74xprb-pigx-chipseq-0.0.42/libexec/pigx_chipseq/Rules/Mapping.py:
Unknow input genome file format
  File "/gnu/store/09a5iq080g9b641jyl363dr5jkkvnhcn-python-3.8.2/lib/python3.8/concurrent/futures/thread.py", line 57, in run
  File "/gnu/store/nxk52pcmvizq6vlb4xmg2x0c3s74xprb-pigx-chipseq-0.0.42/libexec/pigx_chipseq/Rules/Mapping.py", line 67, in __rule_link_genome
  File "/gnu/store/nxk52pcmvizq6vlb4xmg2x0c3s74xprb-pigx-chipseq-0.0.42/libexec/pigx_chipseq/Rules/Mapping.py", line 45, in link_genome
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /fast/AG_Ohler/prauten/data/Schuelke_Dummy_ChiP-seq/pigx/mm9/out/.snakemake/log/2020-08-07T120005.091283.snakemake.log
```


I think I pinpointed the error after checking the code of pigx_chipseq/Rules/Mapping.py.

The problem is, that the script checks the filetype with magic, and my fasta files are identified as 'ASCII text’, however, only 'text/plain is accepted. 

Before I looked into the source code, I thought a problem might be, that I initially tried with a FASTA file, where all genome information per header was just in a single line, I subsequently split this. For the very long FASTA lines magic identifies the file type 'ASCII text, with very long lines. However, both does not work ;).

I checked also the original genome file that I downloaded from UCSC (mouse mm9) and magic also says it is 'ASCII text’, it is a quite old assembly, so I don’t know whether more recent ones are plain text instead of ASCII.

I tried to circumvent this problem by just gzipping my input genome, however, then, the function   check_fasta_header  (File "/gnu/store/nxk52pcmvizq6vlb4xmg2x0c3s74xprb-pigx-chipseq-0.0.42/libexec/pigx_chipseq/scripts/Check_Config.py", line 219, in check_fasta_header) says I would have whitespaces in the fasta headers, which I checked and there are none in the uncompressed fasta file.

```
SystemExit in line 26 of /gnu/store/nxk52pcmvizq6vlb4xmg2x0c3s74xprb-pigx-chipseq-0.0.42/libexec/pigx_chipseq/scripts/Check_Config.py:
ERROR: Settings file is not properly formated:
Genome fasta headers contain whitespaces.
 Please reformat the headers

  File "/gnu/store/nxk52pcmvizq6vlb4xmg2x0c3s74xprb-pigx-chipseq-0.0.42/libexec/pigx_chipseq/Snake_ChIPseq.py", line 44, in <module>
  File "/gnu/store/nxk52pcmvizq6vlb4xmg2x0c3s74xprb-pigx-chipseq-0.0.42/libexec/pigx_chipseq/scripts/Check_Config.py", line 26, in validate_config
```

I copied just this function and checked both my gzipped genome and my ungzipped genome fasta file.

For the ungzipped I get no error message, however, for the gzipped one I do. Hence, the pipeline also terminates.

I hope this information is sufficient to understand my problem. I am happy to provide more information.

For a quick fix, I am also happy to change my input file format/transform etc. 


Thanks a lot for your help,

Pia










On 6. Aug 2020, at 17:21, Blume, Alexander <Alexand...@mdc-berlin.de> wrote:

Hi Pia,

I am one of the developers responsible for the pigx-chipseq pipeline. 

Sorry that you are having problems with the pipeline, but thank you for sharing with us. 

I will have a closer look at your problem the next days, but it seems that the gzip argument is not forwarded to the trimming tool. 

A fix will hopefully be available by next week, until then I suggest you try to trick the pipeline by gzipping the trimmed fastq files yourself and testing wether it continuous. 

If you find any other bugs or weird things please let us know, we are very happy to receive feedback. 

Best, 
Alex 



--
You received this message because you are subscribed to the Google Groups "pigx" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pigx+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pigx/EAE5B5D5-D292-4465-A013-C08DBEF1483E%40mdc-berlin.de.



Alexand...@mdc-berlin.de

unread,
Aug 7, 2020, 10:12:47 AM8/7/20
to Pia.Raut...@mdc-berlin.de, Alexand...@mdc-berlin.de, pi...@googlegroups.com
Hi Pia,

The perl warnings you can safely ignore, but the error points towards some formatting issues in the settings that you defined for the mapping step.

Could you please send me your settings file for the pipeline? 

Best,
Alex
  


On 7. Aug 2020, at 15:59, Rautenstrauch, Pia <Pia.Raut...@mdc-berlin.de> wrote:

Hi Alex,

Thanks a lot for the fast and helpful reply! Changing the path to the actual file path instead of using a soft link solved the error.

I am sorry to bother you again, I hope there are not a lot more problems coming up. 

In the "Mapping with bowtie2” step I get the following error (for all parallel jobs that got started):

```
[31m    output: /fast/AG_Ohler/prauten/data/Schuelke_Dummy_ChiP-seq/pigx/mm9/out/Mapped/Bowtie/Main/Myf5_rep1/Myf5_rep1.bam
    log: /fast/AG_Ohler/prauten/data/Schuelke_Dummy_ChiP-seq/pigx/mm9/out/Log/Myf5_rep1/Myf5_rep2.bowtie2_Main.log (check log file(s) for error message)
RuleException:
TypeError in line 174 of /gnu/store/nxk52pcmvizq6vlb4xmg2x0c3s74xprb-pigx-chipseq-0.0.42/libexec/pigx_chipseq/scripts/SnakeFunctions.py:
'NoneType' object is not iterable
  File "/gnu/store/nxk52pcmvizq6vlb4xmg2x0c3s74xprb-pigx-chipseq-0.0.42/libexec/pigx_chipseq/Rules/Mapping.py", line 220, in __rule_bowtie2
  File "/gnu/store/nxk52pcmvizq6vlb4xmg2x0c3s74xprb-pigx-chipseq-0.0.42/libexec/pigx_chipseq/scripts/SnakeFunctions.py", line 174, in join_params
  File "/gnu/store/09a5iq080g9b641jyl363dr5jkkvnhcn-python-3.8.2/lib/python3.8/concurrent/futures/thread.py", line 57, in run
```
And some perl warnings:
```
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C”).
```

Thanks again and kind regards,
Pia

Alexand...@mdc-berlin.de

unread,
Aug 7, 2020, 11:58:27 AM8/7/20
to Pia.Raut...@mdc-berlin.de, pi...@googlegroups.com
Hi Pia,

Thanks for sharing the settings file and your concerns. 

So, the error at hand is occurring because in the settings file you commented out all settings nested in the bowtie2 paragraph, 
however unfortunately we are still trying to access its contents as long as the paragraph itself ( ‘bowtie2:’) is not commented out and this leads to the ‘NoneType’ TypeError. 
The easiest fix would be to leave the 'N: 0’ uncommented as this is the default anyway and should not change the results at all. 

Concerning the unclear settings:
-  ‘bam_filter’ applies filters to the aligned reads, where ‘mapq' filters reads based on their respective mapping quality (see here for details http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#mapping-quality-higher-more-unique) and deduplicate removes alignments that map to identical positions which should be done for ATAC seq analysis
-  ‘extract_signal’  adjusts how the figures in the chipqc report are calculated, so ‘expand_peak' says to resize the peaks from whatever size to what is set there and ‘number_of_bins’ sets how bins are used to aggregate the coverage signal over the gene features in the report

Especially the ‘extract_signal’ settings are quite complicated and the defaults are choosen based on what works fast and looks good enough, but the ‘bam_filter’ settings you should adjust based on your analysis and required stringency.

Puh, this was a lot :)

Hope this helps you and explains unclear stuff. 

Best,
Alex 

On 7. Aug 2020, at 16:23, Rautenstrauch, Pia <Pia.Raut...@mdc-berlin.de> wrote:

Hi Alex,

Thanks for taking the time! Please find attached the settings file. Now that I am writing to you anyhow - for me it became not clear from the tutorial and manual which programs the bam_filter and extract_signal belong to (and hence which manual to check for the respective parameters/flags). I did not care too much for now, as I just wanted to check the pipeline in general..

Kind regards,
Pia

<settings.yaml>

Ricardo Wurmus

unread,
Aug 7, 2020, 12:44:22 PM8/7/20
to Alexand...@mdc-berlin.de, Pia.Raut...@mdc-berlin.de, pi...@googlegroups.com

Alexand...@mdc-berlin.de <Alexand...@mdc-berlin.de> writes:

> And some perl warnings:
> ```
> perl: warning: Setting locale failed.
> perl: warning: Please check that your locale settings:
> LANGUAGE = (unset),
> LC_ALL = (unset),
> LANG = "en_US.UTF-8"
> are supported and installed on your system.
> perl: warning: Falling back to the standard locale ("C”).
> ```

These indicate a problem with locales. Make sure that GUIX_LOCPATH is
set to a valid location and that glibc-locales is installed. On the
MDC Max cluster this *should* be the case by default, but if you’re
running it elsewhere you may need to check this by yourself.

--
Ricardo Wurmus

System administrator
BIMSB - Scientific Bioinformatics Platform
Max Delbrueck Center for Molecular Medicine

email: ricardo...@mdc-berlin.de
tel: +49 30 9406 1796

Alexand...@mdc-berlin.de

unread,
Aug 11, 2020, 8:11:01 AM8/11/20
to Pia.Raut...@mdc-berlin.de, Alexand...@mdc-berlin.de, pi...@googlegroups.com
Hi Pia, 

I just checked your fast file on the cluster and it seems to be a zip archive. This compression is usually not compatible with bioinformatics software, so you will need to archive the fastq files with the ‘gzip’ command. 
You may run these two commands after each other the to achieve this: 

‘''
unzip /fast/AG_Ohler/prauten/data/Schuelke_Dummy_ChiP-seq/pigx/mm9/out/Trimmed/Trim_Galore/*/*fastq.gz
gzip /fast/AG_Ohler/prauten/data/Schuelke_Dummy_ChiP-seq/pigx/mm9/out/Trimmed/Trim_Galore/*/*fastq
‘''

Hope this solves the issue.

Best,
Alex
 

On 11. Aug 2020, at 13:48, Rautenstrauch, Pia <Pia.Raut...@mdc-berlin.de> wrote:

Hi Alex,

Thank you very much for the fast and helpful replies and the explanation of the settings section!

I was on Holiday yesterday and only continued with the PiGX pipeline today. I left the N:0 option of bowtie2 uncommented which fixed the previously reported problem. However, I stumbled into another error:

```
Error in rule bowtie2:
    jobid: 0
    output: /fast/AG_Ohler/prauten/data/Schuelke_Dummy_ChiP-seq/pigx/mm9/out/Mapped/Bowtie/Main/Myf5_rep2/Myf5_rep2.bam
    log: /fast/AG_Ohler/prauten/data/Schuelke_Dummy_ChiP-seq/pigx/mm9/out/Log/Myf5_rep2/Myf5_rep2.bowtie2_Main.log (check log file(s) for error message)

RuleException:
CalledProcessError in line 224 of /gnu/store/nxk52pcmvizq6vlb4xmg2x0c3s74xprb-pigx-chipseq-0.0.42/libexec/pigx_chipseq/Rules/Mapping.py:
Command 'set -euo pipefail;  /gnu/store/7sicb2zgpqnav53gaiizmarjdn03ydp2-bowtie-2.3.4.3/bin/bowtie2 -p 2 -x /fast/AG_Ohler/prauten/data/Schuelke_Dummy_ChiP-seq/pigx/mm9/out/Bowtie2_Index/Main/mm9 -U /fast/AG_Ohler/prauten/data/Schuelke_Dummy_ChiP-seq/pigx/mm9/out/Trimmed/Trim_Galore/Myf5_rep2/Myf5_rep2_R.fastq.gz -N 0 2> /fast/AG_Ohler/prauten/data/Schuelke_Dummy_ChiP-seq/pigx/mm9/out/Log/Myf5_rep2/Myf5_rep2.bowtie2_Main.log | /gnu/store/jx7i0x6jzfy9vlwsv6hycdyvih2q16lm-samtools-1.9/bin/samtools view -bhS > /fast/AG_Ohler/prauten/data/Schuelke_Dummy_ChiP-seq/pigx/mm9/out/Mapped/Bowtie/Main/Myf5_rep2/Myf5_rep2.bam' returned non-zero exit status 1.
  File "/gnu/store/nxk52pcmvizq6vlb4xmg2x0c3s74xprb-pigx-chipseq-0.0.42/libexec/pigx_chipseq/Rules/Mapping.py", line 224, in __rule_bowtie2
  File "/gnu/store/09a5iq080g9b641jyl363dr5jkkvnhcn-python-3.8.2/lib/python3.8/concurrent/futures/thread.py", line 57, in run
Exiting because a job execution failed. Look above for error message
```

The log file contains the following information:
```
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LANG = "en_GB.UTF-8"
  are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
Error: reads file does not look like a FASTQ file
terminate called after throwing an instance of 'int'
(ERR): bowtie2-align died with signal 6 (ABRT) (core dumped)
```
I checked the input FASTQ file (/fast/AG_Ohler/prauten/data/Schuelke_Dummy_ChiP-seq/pigx/mm9/out/Trimmed/Trim_Galore/Myf5_rep2/Myf5_rep2_R.fastq.gz ). To me, it looks fine. 

I think the following thread could relate to the problem: https://sourceforge.net/p/bowtie-bio/bugs/163/. However, according to the website, the problem should be resolved.

I would appreciate your help and am sorry for running into so many problems.

Pia.Raut...@mdc-berlin.de

unread,
Sep 16, 2021, 9:32:50 AM9/16/21
to pi...@googlegroups.com
Dear maintainers of the PiGx pipelines,

I am planning to use the PiGx ChiP-seq pipeline and have a question regarding the installation with guix. I don't quite understand what ```guix environment -l guix.scm``` does and the explanation: "Assuming you have Guix installed, the following command spawns a sub-shell in which all dependencies are available" means in practice. Is this a "one time" shell that is opened? Can I only use this in interactive mode? In case I can use it also from within submitted job scripts, how can I connect to this shell? Is it possible to also do this for a named guix environment? And how would the command look like?

Ideally I would create a named guix environment with only the PiGx ChiP-seq pipeline and all its dependencies installed.

Thanks a lot and kind regards,
Pia

Ricardo Wurmus

unread,
Sep 16, 2021, 9:58:16 AM9/16/21
to Pia.Raut...@mdc-berlin.de, pi...@googlegroups.com

Hi Pia,

> I am planning to use the PiGx ChiP-seq pipeline and have a
> question regarding the installation with guix. I don't quite
> understand what ```guix environment -l guix.scm``` does and the
> explanation: "Assuming you have Guix installed, the following
> command spawns a sub-shell in which all dependencies are
> available" means in practice.

This is only useful for development of the pipeline. This
sub-shell disappears when you exit it, and yes it is only used
interactively.

If you actually want to use the pipeline you can install it with
Guix:

guix install pigx-chipseq

This will install the package into your default profile at
~/.guix-profile and you will end up with an executable
~/.guix-profile/bin/pigx-chipseq.

> Ideally I would create a named guix environment with only the
> PiGx ChiP-seq pipeline and all its dependencies installed.

You can also install the pigx-chipseq package into a separate Guix
profile, e.g. with this command:

guix package --profile=/somewhere/else/.guix-profile
--install=pigx-chipseq

You needn’t install any dependencies anywhere. PiGx keeps exact
references to all dependencies under /gnu/store, so they don’t
need to be explicitly installed into the profile.

--
Ricardo

Pia.Raut...@mdc-berlin.de

unread,
Sep 16, 2021, 10:01:10 AM9/16/21
to Ricardo...@mdc-berlin.de, pi...@googlegroups.com
Hi Ricardo,

Thank you very much for the instructive explanation!

Reading the installation instructions, I got the impression I have to "manually" install the dependencies, hence the confusion.

Thanks again for the clarification!

Best,
Pia

Ricardo Wurmus

unread,
Sep 16, 2021, 10:27:38 AM9/16/21
to Rautenstrauch, Pia, pi...@googlegroups.com

Hi Pia,

> Reading the installation instructions, I got the impression I
> have to "manually" install the dependencies, hence the
> confusion.

Yeah, you are right. This is a bit confusing. The section starts
with

You can install this pipeline with all its dependencies using
GNU Guix:

guix install pigx-chipseq

but it then goes on to talk about *manual* compilation in the same
section. It may be a good idea to move all that talk about manual
compilation to a separate section.

Thanks for your feedback!

--
Ricardo

Pia.Raut...@mdc-berlin.de

unread,
Sep 19, 2021, 8:35:53 AM9/19/21
to pi...@googlegroups.com
Hi there,

I am trying to run the ChIP-seq PiGx pipeline and am having trouble with the reference genome and annotations.

I downloaded the most recent genome (fasta files) and annotation from NCBI  (https://www.ncbi.nlm.nih.gov/genome/51?genome_assembly_id=582967).

Files look like this:
```


zcat GCF_000001405.39_GRCh38.p13_genomic.fna.gz | head
>NC_000001.11 Homo sapiens chromosome 1, GRCh38.p13 Primary Assembly
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN



zcat GCF_000001405.39_GRCh38.p13_genomic.gff.gz | head 
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build GRCh38.p13
#!genome-build-accession NCBI_Assembly:GCF_000001405.39
#!annotation-date 05/14/2021
#!annotation-source NCBI Homo sapiens Updated Annotation Release 109.20210514
##sequence-region NC_000001.11 1 248956422
NC_000001.11    RefSeq  region  1       248956422       .       +       .       ID=NC_000001.11:1..248956422;Dbxref=taxon:9606;Name=1;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA
```

The pipeline terminates with:

```
Commencing snakemake run submission locally
SystemExit in line 40 of /gnu/store/kzxackdy32cjghdy8d1508ggp2bzn4x3-pigx-chipseq-0.0.52/libexec/pigx_chipseq/scripts/Check_Config.py:
ERROR: Settings file is not properly formated:
Genome fasta headers contain whitespaces.
 Please reformat the headers

  File "/gnu/store/kzxackdy32cjghdy8d1508ggp2bzn4x3-pigx-chipseq-0.0.52/libexec/pigx_chipseq/Snake_ChIPseq.py", line 47, in <module>
  File "/gnu/store/kzxackdy32cjghdy8d1508ggp2bzn4x3-pigx-chipseq-0.0.52/libexec/pigx_chipseq/scripts/Check_Config.py", line 40, in validate_config
```

Unfortunately just removing whitespaces from the fasta headers doesn’t work, as then the match of names between the fasta files and the off file breaks. 

I was wondering whether there is a quick fix for this or whether you experienced similar problems before? I suprised to run into trouble with the most recent human genome version.

Help highly appreciated!

Thanks and best wishes,
Pia


P.s.: In case you want to look at my project, it lives here: ```/fast/AG_Ohler/prauten/projects/BeyondTheExome/202109_ChIP-seq_test/```

Alexander Blume

unread,
Sep 19, 2021, 8:40:21 AM9/19/21
to Pia.Raut...@mdc-berlin.de, pi...@googlegroups.com
Hi Pia, 

I'll refer to the answer that we have in the Troubleshooting FAQ (https://bioinformatics.mdc-berlin.de/pigx_docs/pigx-chip-seq.html#troubleshooting):

Our pipeline does not like if in the FASTA headers of the reference genome there is more information than the chromosome name, so please use this oneliner to format your genome-file:

cat refGenome.fa  | awk '{ print $1}’ >  refGenome_noWhiteSpace.fa


Best,
Alex


--
You received this message because you are subscribed to the Google Groups "pigx" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pigx+uns...@googlegroups.com.

Pia.Raut...@mdc-berlin.de

unread,
Sep 21, 2021, 2:58:06 PM9/21/21
to Alexander Blume, pi...@googlegroups.com
Hi Alex,

Thanks a lot for the quick response and help! I am sorry for not having read the FAQ.. I don’t know whether this is of interest to you, but I also had to remove the header from the corresponding GTF annotation file for the pipeline to run (might be worth adding to the FAQ answer)
```sed '/^#/d' annotation_file.gtf > annotation_file_no_header.gtf```

I successfully ran the PiGx ChIP-seq pipeline on my samples, but unfortunately several of the figures in the report are empty.

Namely:
- 4 Reads Mapping to Mitochondrial Chromosome
- 6.1 GC Content Distribution Per Sample
- 8.1 Frequency of reads in peaks

Do you have an idea what might have gone wrong or how to fix this?

Thanks and best,
Pia

Alexander Blume

unread,
Sep 21, 2021, 4:50:49 PM9/21/21
to Pia.Raut...@mdc-berlin.de, pi...@googlegroups.com
Hi Pia,

Great you figured out the gtf issue!

Concerning the problem with the report, I have one good, but two bad answers.

First the bad ones: Right now, I do not have an idea why the 'GC Content Distribution Per Sample’ or the 'Frequency of reads in peaks’ did not work. You may want to inspect the chipqc log files for the samples to check if there are some obvious errors.

But for the plot showing the reads mapping to mitochondrial chromosome I might have a solution: In the settings file we have a setting called ‘chrM_givenName’ in the ‘chipqc’ part of the ‘general’ params section. You may want to put 'NC_012920.1’ in there, since the genome assembly that you are working with does not use the standard chromosome names in the fasta and gtf file, like chrM or MtDNA, but has these special refsec ids. However to let the report pick this change up, you may need to remove or rename the 'Analysis/RDS/ChIPQC’ folder and the ‘Analysis/Summarized_Data_For_Report.RDS’ file and rerun the pipeline to trigger re-execution of the chipqc rule.

I hope this helps you for now, but we can also try to figure out what is wrong with the other plots.

Best,
Alex

> On 21. Sep 2021, at 20:48, Pia.Raut...@mdc-berlin.de wrote:
>
> Hi Alex,
>
> Thanks a lot for the quick response and help! I am sorry for not having read the FAQ.. I don’t know whether this is of interest to you, but I also had to remove the header from the corresponding GTF annotation file for the pipeline to run (might be worth adding to the FAQ answer)
> ```sed '/^#/d' annotation_file.gtf > annotation_file_no_header.gtf```
>
> I successfully ran the PiGx ChIP-seq pipeline on my samples, but unfortunately several of the figures in the report are empty.
>
> Namely:
> - 4 Reads Mapping to Mitochondrial Chromosome
> - 6.1 GC Content Distribution Per Sample
> - 8.1 Frequency of reads in peaks
>
>
> I attached the html report.
>
> Do you have an idea what might have gone wrong or how to fix this?
>
> Thanks and best,
> Pia
>
>
>
> > On 19. Sep 2021, at 14:40, Alexander Blume <alex....@gmail.com> wrote:
> >
> <20210920_PiGx_ChIP_Seq_Report.html>

Pia.Raut...@mdc-berlin.de

unread,
Sep 22, 2021, 3:26:48 AM9/22/21
to Alexander Blume, pi...@googlegroups.com
Hi Alex,

Thanks again for the very quick and helpful response!

Regarding your tip for the mitochondrial genome - I had a similar thought, but since I wrote to you to ask about the chromosome naming issue, I have switched to a genome file from “GRCh38_major_release_seqs_for_alignment_pipelines”, where mitochondrial genome is called “chrM”. So the parameter used in the settings file for my analysis is correct and the plot still empty.  

Maybe of interest to you: Last year (on a different sample I ran the PiGx ChIP-seq pipeline version 0.0.42, and in this reports all figures are displayed), now I am using version  0.0.52,


I found one thing that might point towards why "6.1 GC Content Distribution Per Sample” does not work:

From knit_report.log
```
There were 50 or more warnings (use warnings() to see the first 50)
Warning messages:
1: replacing previous import 'Biostrings::pattern' by 'grid::pattern' when loading 'genomation'
2: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
3: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
4: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
5: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
6: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
7: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
8: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
Using compatibility `.name_repair`.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
9: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
10: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
11: Transformation introduced infinite values in continuous y-axis
12: Removed 17148 rows containing non-finite values (stat_binhex).
13: Computation failed in `stat_binhex()`:
The `hexbin` package is required for `stat_binhex()`
14: Transformation introduced infinite values in continuous y-axis
15: Removed 17229 rows containing non-finite values (stat_binhex).
16: Computation failed in `stat_binhex()`:
The `hexbin` package is required for `stat_binhex()`
17: Transformation introduced infinite values in continuous y-axis
18: Removed 17177 rows containing non-finite values (stat_binhex).
19: Computation failed in `stat_binhex()`:
The `hexbin` package is required for `stat_binhex()`
20: Transformation introduced infinite values in continuous y-axis
21: Removed 17153 rows containing non-finite values (stat_binhex).
22: Computation failed in `stat_binhex()`:
The `hexbin` package is required for `stat_binhex()`
23: Transformation introduced infinite values in continuous y-axis
24: Removed 17184 rows containing non-finite values (stat_binhex).
25: Computation failed in `stat_binhex()`:
The `hexbin` package is required for `stat_binhex()`
26: Transformation introduced infinite values in continuous y-axis
27: Removed 17251 rows containing non-finite values (stat_binhex).
28: Computation failed in `stat_binhex()`:
The `hexbin` package is required for `stat_binhex()`
29: Transformation introduced infinite values in continuous y-axis
30: Removed 17228 rows containing non-finite values (stat_binhex).
31: Computation failed in `stat_binhex()`:
The `hexbin` package is required for `stat_binhex()`
32: Transformation introduced infinite values in continuous y-axis
33: Removed 17211 rows containing non-finite values (stat_binhex).
34: Computation failed in `stat_binhex()`:
The `hexbin` package is required for `stat_binhex()`
35: Transformation introduced infinite values in continuous y-axis
36: Removed 17243 rows containing non-finite values (stat_binhex).
37: Computation failed in `stat_binhex()`:
The `hexbin` package is required for `stat_binhex()`
38: Transformation introduced infinite values in continuous y-axis
39: Removed 17251 rows containing non-finite values (stat_binhex).
40: Computation failed in `stat_binhex()`:
The `hexbin` package is required for `stat_binhex()`
41: Transformation introduced infinite values in continuous y-axis
42: Removed 17206 rows containing non-finite values (stat_binhex).
43: Computation failed in `stat_binhex()`:
The `hexbin` package is required for `stat_binhex()`
44: Transformation introduced infinite values in continuous y-axis
45: Removed 17268 rows containing non-finite values (stat_binhex).
46: Computation failed in `stat_binhex()`:
The `hexbin` package is required for `stat_binhex()`
47: Transformation introduced infinite values in continuous y-axis
48: Removed 17220 rows containing non-finite values (stat_binhex).
49: Computation failed in `stat_binhex()`:
The `hexbin` package is required for `stat_binhex()`
50: Transformation introduced infinite values in continuous y-axis
```

In the session info at the end of the PiGx report there is also no package hexbin listed (in contrast to the reports I got from version 0.0.42).


I’d be very happy if you could point me to where else I could go looking for potential sources of the reports not compiling. Especially the "8.1 Frequency of reads in peaks” would be of interest.


Thanks a lot and kind regards,
Pia






zcat GCF_000001405.39_GRCh38.p13_genomic.fna.gz | head
NC_000001.11 Homo sapiens chromosome 1, GRCh38.p13 Primary Assembly
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN




<20210920_PiGx_ChIP_Seq_Report.html>

Alexander Blume

unread,
Sep 22, 2021, 3:45:10 AM9/22/21
to Pia.Raut...@mdc-berlin.de, pi...@googlegroups.com
Hi Pia, 

You might want to check the 'Analysis/Summarized_Data_For_Report.RDS’ file, as this is the one that contains all summarised data and gets loaded when compiling the report. 
Maybe together with the code  of the report (https://github.com/BIMSBbioinfo/pigx_chipseq/blob/main/scripts/Sample_Report.rmd.in) you could check the contents of the RDS file by loading it with ```lstats = readRDS('Analysis/Summarized_Data_For_Report.RDS’)``` and running the corresponding sections ‘Peak_Statistics_Freq_Reads_In_Peaks’ and ‘Reads_Mapping_to_Mitochondrial_Chromosome’.  Maybe with the data loaded, you stumble upon something obvious. 


Best
Alex


Reply all
Reply to author
Forward
0 new messages