Using psi.tsv for machine learning

29 views
Skip to first unread message

Onkar Mulay

unread,
Oct 23, 2021, 11:37:51 AM10/23/21
to majiq_voila
HI,

can I know once we get psi.tsv file from MAJIQ...
how do you transform it to matrix format to be put into machine learning pipeline?
like rows as samples and columns as psi_values_of_junctions (any tips?/ links?)

How should I transform it for Machine learning?


Regards,
Onkar

jai...@biociphers.org

unread,
Oct 25, 2021, 9:23:27 AM10/25/21
to majiq_voila
Dear Onkar,

  • psi.tsv provides rows per LSV, so quantifications are provided in terms of semicolon-delimited lists. You will probably want to "explode" the columns. In pandas, you can do this using pd.Series.str.split(";").explode(); in tidyr, this would be tidyr::separate_rows() (with appropriate arguments you can read in the docs in both cases).
  • Since quantified LSVs per group will be different (and in different orders, you will want to keep track of LSV/junction identity and appropriately "join" the tables)
  • Note that a single junction can have up to 2 quantifications because of source vs target LSVs. In general, splicing quantifications in a gene may be explicitly or implicitly correlated. You will have to decide which changes you want to learn vs ignore, and how to do so in your pipeline.
I think that's most of the MAJIQ-specific aspects for the pretty broad question of how to use the splicing quantifications for "machine learning". Otherwise, would need more information about what biology/ML-specific questions you were trying to answer which might be outside of the scope of this mailing list.

Best,
Joseph

Reply all
Reply to author
Forward
0 new messages