Genome assembly mm39 and automatize for all HOCOMOCO TF

Brochut Maelick

unread,

Jun 19, 2024, 4:06:41 AM6/19/24

to ask...@googlegroups.com

Hello PWMScan team,

I am a young bioinformatician from the Lausanne University Hospital (CHUV) in Switzerland and I want to use your tool PWMScan.

More precisely, I want to find all location, on the mouse genome (mm39), for the list of all transcription factor from the HOCOMOCO V12 database.

I wanted to automatize this task for all TF of the database HOCOMOCO, but I didn't find any API to do it. Is there a another way to do it for the whole database? Also I wanted to use the latest version of genome assembly of the mice mm39 but it didn't appear on your list. I try to upload the full genome (2.8G) but it's to big. Would it also be possible to put mm39?

Best regards

Maëlick Brochut

CCG EPFL

unread,

Jun 27, 2024, 10:05:54 AM6/27/24

to Ask EPD

Dear Maëlick

We plan to add mm39 to the PWMScan server but I can't promise you that this will happen soon. Also, we don't have an API currently.

However, you could use the PWMScan software locally under Linux or MacOS. Download the latest version from:

https://gitlab.sib.swiss/EPD/pwmscan

You will need two programs, matrix_scan and matrix_prob.

matrix_scan scans a sequence library with a single position weight matrix (PWM). The HOCOMOCO V2 CORE PWM library is available from:

https://epd.expasy.org/ftp/pwmlib/hocomocov12_core_matrix_logodds.mat

You will need to split this library file into individual files containing one matrix. The files will have the following format:

>log-odds matrix AHR.H12CORE.0.P.B: alength= 4 w= 10 n= 0 bayes= 0 E= 0

-51 29 64 -97

-42 53 -72 26

-130 -91 59 64

-109 -209 -146 155

-497 -190 175 -158

-639 198 -797 -427

-480 -697 198 -539

-697 -10000 -10000 200

-797 -10000 200 -10000

-76 158 -497 -137

(If you need help with the splitting, we can do it for you)

The program matrix_scan requires a cut-off value in raw score units. You can determine the raw score corresponding to the Pval = 0.00001 using matrix_prob:

matrix_prob -e 0.00001 -b 0.29,0.21,0.21,0.29 AHR.H12CORE.0.P.B.mat

(returns: SCORE : 1305)

You can then run matrix_scan as follows:

matrix_scan -m AHR.H12CORE.0.P.B.mat -c 1305 < mm39.fna > ..

This took 30 seconds on my MacBook Pro.

I hope this was helpful. Don't hesitate to contact me again of you run into problems or if you need clarifications on some points.

Good luck,

Philipp

Brochut Maelick

unread,

Jun 28, 2024, 5:15:29 AM6/28/24

to ask...@googlegroups.com

Hello,

As I haven't heard back from you, I wanted to follow up on my previous email regarding the possibility of updating the mouse genome assembly to mm39 and how to automate the process for multiple requests.

Best regards,

Maëlick brochut

De : Brochut Maelick
Envoyé : mercredi, 19 juin 2024 10:06:35
À : ask...@googlegroups.com
Objet : Genome assembly mm39 and automatize for all HOCOMOCO TF

Maëlick Brochut

unread,

Jun 28, 2024, 7:31:18 AM6/28/24

to Ask EPD

Thanks a lot Philipp!

I think it is very clear and I will clone the repo and try to do it myself on my linux server.

I just have two questions concerning the genome. When you are scanning the whole genome (mm39.fna): matrix_scan -m AHR.H12CORE.0.P.B.mat -c 1305 < mm39.fna > .. In your example.

Was it with the whole genome or a filtered one ? And can I use the fasta file that i use for my alignement in my ATAC-seq experiment?

Alignement done with Ensembl genome assembly: https://ftp.ensembl.org/pub/release-112/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna_sm.toplevel.fa.gz

Cheers

Maëlick

CCG EPFL

unread,

Jun 30, 2024, 5:51:40 AM6/30/24

to Ask EPD

https://ftp.ensembl.org/pub/release-112/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna_sm.toplevel.fa.gz

This file should work. I get the same result as with the my fasta file downloaded from NCBI. Note however that matrix_scan will automatically translated lower case letters into uppercase on input. For a repeat-masked genome, where repeat regions are presented in lower case, matrix_scan will thus consider these regions but you will not noticed it because the motif matches are are displayed in upper case on output. If you would like to exclude masked regions from the search, you would would have to replace lower case letters in the genome file by N's or n's.

Philipp

Maëlick Brochut

unread,

Jul 2, 2024, 9:38:55 AM7/2/24

to Ask EPD

Everything works perfectly with your instructions, and it runs really fast!

Thanks a lot for this great tool and for being responsive to my questions.

Have a nice day.

Maëlick

Reply all

Reply to author

Forward