Tidy format

27 views
Skip to first unread message

Dominick Costanzo

unread,
Oct 17, 2025, 3:41:44 PM (11 days ago) Oct 17
to R/qtl discussion
Hi Karl,
I have a massive haploid yeast data set that is quite difficult to manipulate, ~42k markers and 99,950 individuals( most of this work is done on a computing cluster). I have the genotype data as individuals in each row and markers as the columns with A/B for genotypes. I have a separate linkage map with chromosome and cM positions for each marker not dropped( a few did get dropped). And I have a separate file where all the phenotypes are each phenotype being a column, and the value for an individual being a row. I saw there is a "tidy" format that would allow me to use read.cross and load each file without any combining, but I seem to be having trouble getting this to work. Im currently working on a small subset of the genotypes locally to try to get this to work, here is my current script and errors. My workflow plan is to do scanone, and then use the top 5 peaks as covariates in CIM. And then eventually running scantwo. Any help with this would be greatly appreciated!
Thanks,
Dominick

if (interactive() && length(args) == 0)
  {args <- c(
    "C:/Users/domco/Desktop/stepwiseqtl/fivegenotypes.csv",
    "C:/Users/domco/Desktop/stepwiseqtl/combined_link_map.csv",
    "C:/Users/domco/Desktop/stepwiseqtl/CIM",
    "cell_cycle_G1",
    "C:/Users/domco/Desktop/stepwiseqtl/combined_phenotypes.csv")}

#arguements from slurm
genotype <- args[1]
map_file <- args[2]
output_dir <- args[3]
pheno <- args[4]
phenotype<- args[5]


dir.create(output_dir, showWarnings=TRUE, recursive=TRUE)

cat("genotype File:", genotype)
cat("Linkage map:", map_file)
cat("Output directory:", output_dir)
cat("Phenotype:", pheno)
cat("Phenotype file:", phenotype)

#read linkage map file
#map=fread(map_file)



#read the cross file(wide file)
cross <- read.cross(format="tidy",
                    header=TRUE,
                    mapfile=map_file,
                    genfile=genotype,
                    phefile=phenotype,
                    genotypes=c("A", "B"),
                    na.strings="NA",
                    estimate.map=FALSE)

cat("Cross successfully loaded")

Error in rep(NA, n.add * ncol(gen)) : invalid 'times' argument In addition: Warning messages: 1: In dir.create(output_dir, showWarnings = TRUE, recursive = TRUE) : 'C:\Users\domco\Desktop\stepwiseqtl\CIM' already exists 2: In read.cross.csvs(dir, genfile, phefile, na.strings, genotypes, : Including "" in na.strings will cause problems; omitted. 3: In n.add * ncol(gen) : NAs produced by integer overflow

Karl Broman

unread,
Oct 17, 2025, 7:27:54 PM (11 days ago) Oct 17
to R/qtl discussion
It's hard to tell without seeing the data files, but it seems like it's trying to make a dataset that is larger than what can be stored. ... that n.add * ncol(gen) is larger than the biggest integer.
But 42000 markers * 99950 individuals shouldn't be that large.

karl

Dominick Costanzo

unread,
Oct 20, 2025, 10:25:59 AM (8 days ago) Oct 20
to R/qtl discussion
Hi Karl,
Thanks so much for the quick response! here are the first few rows of my data files, does that help?

Genotype file
id,snp1,snp2,snp3,snp4,snp5,snp6,snp7,snp8,snp10
0,A,A,A,A,A,A,A,A,A
1,A,A,A,A,A,A,A,A,A
2,A,A,A,A,A,A,A,A,A
3,A,A,A,A,A,A,A,A,A
4,B,B,B,B,B,B,B,B,B
5,A,A,A,A,A,A,A,A,A
6,A,A,A,A,A,A,A,A,A
7,A,A,A,A,A,A,A,A,A
8,B,B,B,B,B,B,B,B,B

Phenotype file
id,cell_cycle_G1,cell_cycle_G2,cell_cycle_M,cell_cycle_S,mating_efficiency_BYpartner,mating_efficiency_RMpartner,pheno_data_30C
0,,,,,0.0636429,,
1,,,,,,,-0.1020694
2,,,,,,,-0.0795859
3,,,,,,,-0.0695231
4,,,,,,,-0.0598173
5,,,,,,,-0.1414214
6,,,,,,,-0.0479325
7,,,,,,,-0.0971627
8,,,,,,,-0.1014535

Linkage map
chromosome,marker,cM
1,snp5,0
1,snp6,0.0000001
1,snp7,0.2560438
1,snp8,0.2560439
1,snp9,0.2600487
1,snp10,0.2610462
1,snp11,0.2610463
1,snp12,0.2630433

Karl Broman

unread,
Oct 20, 2025, 4:10:00 PM (8 days ago) Oct 20
to R/qtl discussion
That doesn't much help. I would want the actual files in order to try to reproduce the problem.

Also, looking back at your error, "rep(NA, n.add * ncol(gen))" is something that would happen only in read.cross with format ="csvs" or "csvsr"
And the warning message mentions read.cross.csvs().

But neither of these should happen with read.cross with format="tidy". 

So I'm puzzled.

karl

Martin Ferris

unread,
Oct 20, 2025, 4:29:24 PM (8 days ago) Oct 20
to rqtl...@googlegroups.com
I had a similar issue years ago, but can't explicitly remember the details. I think I solved it by doing a more manageable read-in that was (e.g.) 10 samples and 1,000 markers. I found it was a slightly odd formatting issue where qtl didn't recognize the sample names between the two sheets, so it created millions of extra 'na' cells....enough to fill the memory I had at the time

Regards,
Marty

--
You received this message because you are subscribed to the Google Groups "R/qtl discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rqtl-disc+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/rqtl-disc/47eca4e7-b9b9-463e-8712-5ba97398e496n%40googlegroups.com.

Dominick Costanzo

unread,
Oct 24, 2025, 3:50:34 PM (4 days ago) Oct 24
to R/qtl discussion
I couldnt get the tidy format to take the files at all, so I resorted to switching to csvs. however im now running into new errors in read.cross that i cant seem to sort out. I have attached a small subset of my dataset here (genofile- it is identical to the full one just smaller) and my full phenotype file and i will paste the error below. It seems to keep adding 10k extra random rows but they are all empty. Its is also having trouble matching the ids between pheno and geno files, I initially thought it was because i was specifying the header as false, but making it true causes it to treat the chromosome row as the snp and the map locations as genotypes. Any idea what could be causing this or do you know if it will read those rows? if it throws them out since presumably they are empty i guess that would be fi

--Read the following data:
         109950  individuals
         41563  markers
         8  phenotypes
 --Cross type: bc


Warning messages:
1: In read.cross.csvs(dir, genfile, phefile, na.strings, genotypes,  :
  10000 individuals with genotypes but no phenotypes
        0|    1|    2|    3|    4|    5|    6|    7|    8|    9|   10|   11|   12|   13|   14|   15|   16|   17|   18|   19|   20|   21|   22|   23|   24|   25|   26|   27|   28|   29|   30|   31|   32|   33|   34|   35|   36|   37|   38|   39|   40|   41|   42|   43|   44|   45|   46|   47|   48|   49|   50|   51|   52|   53|   54|   55|   56|   57|   58|   59|   60|   61|   62|   63|   64|   65|   66|   67|   68|   69|   70|   71|   72|   73|   74|   75|   76|   77|   78|   79|   80|   81|   82|   83|   84|   85|   86|   87|   88|   89|   90|   91|   92|   93|   94|   95|   96|   97|   98|   99|  100|  101|  102|  103|  104|  105|  106|  107|  108|  109|  110|  111|  112|  113|  114|  115|  116|  117|  118|  119|  120|  121|  122|  123|  124|  125|  126|  127|  128|  129|  130|  131|  132|  133|  134|  135|  136|  137|  138|  139|  140|  141|  142|  143|  144|  145|  146|  147|  148|  149|  150|  151|  152|  153|  154|  155|  156|  1 [... truncated]

2: In read.cross.csvs(dir, genfile, phefile, na.strings, genotypes,  :
  10000 individuals with phenotypes but no genotypes
    0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67|68|69|70|71|72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91|92|93|94|95|96|97|98|99|100|101|102|103|104|105|106|107|108|109|110|111|112|113|114|115|116|117|118|119|120|121|122|123|124|125|126|127|128|129|130|131|132|133|134|135|136|137|138|139|140|141|142|143|144|145|146|147|148|149|150|151|152|153|154|155|156|157|158|159|160|161|162|163|164|165|166|167|168|169|170|171|172|173|174|175|176|177|178|179|180|181|182|183|184|185|186|187|188|189|190|191|192|193|194|195|196|197|198|199|200|201|202|203|204|205|206|207|208|209|210|211|212|213|214|215|216|217|218|219|220|221|222|223|224|225|226|227|228|229|230|231|232|233|234|235|236|237|238|239|240|241|242|243|244|245|246|247|248|249|250|251|252|253|254|255|256|257|258|259|260|261|262|263 [... truncated]
3: In summary.cross(cross) : The individual IDs are not unique.
[1] 109950

cross <- read.cross(format="csvs",
                        header=FALSE,

                        genfile=genotype,
                        phefile=phenotype,
                        genotypes=c("A","B"),
                        estimate.map=FALSE,
                        na.strings="NA")

genotype_head100.csv

Dominick Costanzo

unread,
Oct 24, 2025, 3:50:59 PM (4 days ago) Oct 24
to R/qtl discussion
combined_phenotypes.csv

Karl Broman

unread,
Oct 24, 2025, 5:32:55 PM (4 days ago) Oct 24
to R/qtl discussion
I'm able to read these data without any errors. I just get a warning about 99852 individuals with phenotypes but no genotypes.

x <- read.cross("csvs", "", "genotype_head100.csv", "combined_phenotypes.csv", genotypes=c("A", "B"))
  --Read the following data:
      99950  individuals
      16383  markers
      8  phenotypes
  --Cross type: f2
 Warning message:

 In read.cross.csvs(dir, genfile, phefile, na.strings, genotypes,  :
   99852 individuals with phenotypes but no genotypes
     98|99|100|101|102|103|104|105|106|107|108|109|110|111|112|113|114|115|116|117|118|119|120|121|122|123|124|125|126|127|128|129|130|131|132|133|134| 135|136|137|138|139|140|141|142|143|144|145|146|147|148|149|150|151|152|153|154|155|156|157|158|159|160|161|162|163|164|165|166|167|168|169|170|171|172|173|174|175|176|177|178|179|180|181|182|183|184|185|186|187|188|189|190|191|192|193|194|195|196|197|198|199|200|201|202|203|204|205|206|207|208|209|210|211|212|213|214|215|216|217|218|219|220|221|222|223|224|225|226|227|228|229|230|231|232|233|234|235|236|237|238|239|240|241|242|243|244|245|246|247|248|249|250|251|252|253|254|255|256|257|258|259|260|261|262|263|264|265|266|267|268|269|270|271|272|273|274|275|276|277|278|279|280|281|282|283|284|285|286|287|288|289|290|291|292|293|294|295|296|297|298|299|300|301|302|303|304|305|306|307|308|309|310|311|312|313|314|315|316|317|318|319|320|321|322|323|324|325|326|327|328|329|330|331|332|333|334 [... truncated]


If I reduce the phenotype file to just the first 99 lines, it reads in with no warnings.

karl

On Friday, October 24, 2025 at 2:50:59 PM UTC-5 dsc...@lehigh.edu wrote:

Dominick Costanzo

unread,
Oct 24, 2025, 5:35:58 PM (4 days ago) Oct 24
to rqtl...@googlegroups.com
Yes the same is the case for  me but as soon as I use the full data set I get the warnings about 10k extra rows added. Am I exceeding some sort of internal limit that is causing weird behavior? The full data file is exactly the same just with 99,950 rows and 41k columns 


On Oct 24, 2025, at 5:33 PM, Karl Broman <kbr...@gmail.com> wrote:


--
You received this message because you are subscribed to a topic in the Google Groups "R/qtl discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rqtl-disc/N8uXDb5avYw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rqtl-disc+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/rqtl-disc/88df785f-ea66-47bd-924e-f45a365601d9n%40googlegroups.com.

Karl Broman

unread,
Oct 24, 2025, 5:39:12 PM (4 days ago) Oct 24
to R/qtl discussion
In that previous warning you got

10000 individuals with genotypes but no phenotypes
        0|    1|    2|    3|    4|    5|    6|    7|    8|    9|   10|   11|   12|   13|   14|   15|   16|   17|   18|   19|   20|   21|   22|   23|   24|   25|   26|   27|   28|   29|   30|   31|   32|   33|   34|   35|   36|   37|   38|   39|   40|   41|   42|   43|   44|   45|   46|   47|   48|   49|   50|   51|   52|   53|   54|   55|   56|   57|   58|   59|   60|   61|   62|   63|   64|   65|   66|   67|   68|   69|   70|   71|   72|   73|   74|   75|   76|   77|   78|   79|   80|   81|   82|   83|   84|   85|   86|   87|   88|   89|   90|   91|   92|   93|   94|   95|   96|   97|   98|   99|  100|  101|  102|  103|  104|  105|  106|  107|  108|  109|  110|  111|  112|  113|  114|  115|  116|  117|  118|  119|  120|  121|  122|  123|  124|  125|  126|  127|  128|  129|  130|  131|  132|  133|  134|  135|  136|  137|  138|  139|  140|  141|  142|  143|  144|  145|  146|  147|  148|  149|  150|  151|  152|  153|  154|  155|  156|  1 [... truncated]

...it looks like your ID column has a bunch of spaces preceding the IDs, which are interpreted as being part of the IDs, so it's viewing them as different from the individuals in the phenotype file.

karl

Dominick Costanzo

unread,
Oct 24, 2025, 6:18:55 PM (4 days ago) Oct 24
to rqtl...@googlegroups.com
Thats really strange because the miniature version of the data i use is simply the first 100 rows stripped off so that should have the same spaces i would think? Im going to try to clean the spaces off and run it again

Dominick Costanzo

unread,
Oct 27, 2025, 1:24:25 PM (20 hours ago) Oct 27
to rqtl...@googlegroups.com
thanks so much for your help Karl, you were correct somehow there were mystery spaces there, ive removed them and fixed the error.
I was hoping i could get your thoughts on the best way of narrowing down QTL locations. Im using CIM and ive run a variety of window sizes from 5-100cM, im currently running bootstraps just for comparison, and I am planning on calculating bayesian confidence intervals. My LODs are quite high so LOD drop does not always result in a realistic estimate wider than a single cM location. besides just calling the causative location, i'd also like a better way of comparing between phenotypes if the locations are the same besides just overlaying the plots on one another, any suggestions to this end?
Thanks again,
Dom
Reply all
Reply to author
Forward
0 new messages