I am attempting to run importRdata() but I keep getting an error just for this specific dataset. I have analyzed other datasets from the same source (ENCODE) and have not had any issues. The only difference in terms of data structure that I'm aware of is that other datasets have 2 runs each for the WT and KD conditions, while this dataset has 2 WT and 4KD runs (as seen in the design dataframe below). importIsoformExpression() ran normally.
Here is the code I'm trying to run:
myDesign = data.frame(sampleID = c("SRR14837933","SRR14837934","SRR14844613","SRR14844614","SRR14844615","SRR14844616"),
condition = c("WT","WT","KD","KD","KD","KD"))
SwitchList = importRdata(isoformCountMatrix = PRPF4quant$counts,
isoformRepExpression = PRPF4quant$abundance,
designMatrix = myDesign,
isoformExonAnnoation = "C:/Users/Caleb/BioinfoData/IsoformSwitch/ENCODE_PRPF4_ISAR/st_merged.gtf",
isoformNtFasta = "C:/Users/Caleb/BioinfoData/IsoformSwitch/ENCODE_PRPF4_ISAR/PRPF4_transcripts.fa",
showProgress = TRUE)
And here is the output of importRdata():
Step 1 of 10: Checking data...
Step 2 of 10: Obtaining annotation...
importing GTF (this may take a while)...
148689 ( 50.76%) isoforms were removed since they were not expressed in any samples.
Step 3 of 10: Fixing StringTie gene annoation problems...
11238 isoforms were assigned the ref_gene_id and gene_name of their associated gene_id.
This was only done when the parent gene_id were associated with a single ref_gene_id/gene_name.
3896 isoforms were assigned the ref_gene_id and gene_name of the most similar
annotated isoform (defined via overlap in genomic exon coordinates).
This was only done if the overlap met the requriements
indicated by the three fixStringTieViaOverlap* arguments.
We were unable to assign 458 isoforms (located within annotated genes) to a known ref_gene_id/gene_name.
These were removed to enable analysis of the rest of the isoform from within the merged genes.
2191 gene_ids which were associated with multiple ref_gene_id/gene_names
were split into mutliple genes via their ref_gene_id/gene_names.
32074 genes_id were assigned their original gene_id instead of the StringTie gene_id.
This was only done when it could be done unambiguous.
Step 4 of 10: Calculating expression estimates from count data...
Skipped as user supplied expression via the "isoformRepExpression" argument...
Step 5 of 10: Testing for unwanted effects...
Error in density.default(x, adjust = adj) : 'x' contains missing values
In addition: Warning messages:
1: In importRdata(isoformCountMatrix = PRPF4quant$counts, isoformRepExpression = PRPF4quant$abundance, :
We found 8976 (2.97%) unstranded transcripts.
These were removed as unstranded transcripts cannot be analysed
2: In importRdata(isoformCountMatrix = PRPF4quant$counts, isoformRepExpression = PRPF4quant$abundance, :
No CDS annotation was found in the GTF files meaning ORFs could not be annotated.
(But ORFs can still be predicted with the analyzeORF() function)