File Fomat

Skip to first unread message

Ronnie Nelson

Mar 21, 2013, 6:12:39 AM3/21/13
Below is a description of the file format. Once the package is installed the files indicated are available in the package directory:

The triM format is used in the example files, as well as one phenotype file, and the following format needs to be used. Note that all the files are tab delimited.

The marker information files:

The marker information files provides information on the number of chromosomes, the total number of markers. The following lines contain the total number of markers per chromosome followed by a numeric ID that indicates which column pair in the genotype file contains the genotypes for that marker. This information is given on one line for each chromosome. The following lines contain the space between each marker for each chromosome provided (i.e. the map data), also one line for each chromosome. The last block of lines contains the name of each marker, one line for each chromosome. Note that the first column of each row containing the positions and names of the markers should be filled with a '1'.

The example marker information file: “mrkinfo_test.txt


The genotype file:

The genotype file provides the genotypic information for each individual. The individual IDs are indicated in the first column of each row. The following columns are filled with integer values indicating the genotype at each locus in sequence. Every 2 columns correspond to one marker (i.e. one allele in each column) and the columns are arranged sequentially in the order described in the marker file.

The example marker information files: “marker_test.txt


The pedigree file:

The pedigree file is arranged in full-sib families. For each family, the number of F2 individuals within the family are provided. This is followed by the individual ID's in the first column (starting with the F0 generation, then the F1s and then the F2s). For each individual his/her parent's ID's are provided in the next two columns followed by it's sex. Note, for the F0 generation the parents are indicated with a '0' while an additional column with line origins is provided.

The example pedigree file: “ped_test.txt


The phenotype file:

The phenotype files provide the information of the phenotypes of all the individuals. The first line contains the heading “ID” in the first column, followed by the phenotype names. The columns are filled with, first the individual ID’s and then the phenotypic values as indicated by the headings.

The example pedigree file: “pheno_test.txt

Federico Calboli

Oct 8, 2013, 10:22:07 AM10/8/13

what happens if I have my data in R already?  I ask this because I see a real risk of getting into 'bioinformatics hell', where most of the time is spent formatting and reformatting data to pass it along, rather than doing actual analyses.

For instance, I have an F1 family that looks like:

   id mother father sex M1/1 M1/2 M2/1 M2/2 M3/1 M3/2
  A00      0      0   1  256  268  162  166  256  272
  B00      0      0   0  240  270  162  172  256  272
  A01    B00    A00   0  268  270  162  162  272  272
  A02    B00    A00   1  240  268  162  162  272  272
  A03    B00    A00   0  240  268  166  172  272  272
  A04    B00    A00   1  256  270  162  162  272  272
  A05    B00    A00   1  240  256  162  172   NA   NA
  A06    B00    A00   1  240  256  166  172  272  272
  A07    B00    A00   1  240  268  166  172  272  272
  A08    B00    A00   0  256  270  162  172  272  272

where the first column is the individual, the second and third the pedigree and the fourth the sex, followed by the length of the microsatellite marker for every locus (two columns per locus, marched as locusname/1 and locusname/2).  The distance between markers is available as a vector of cM distances 

0 90.2 215.3

and the phenos are a normal data frame, one line per individual (parents included).

I really do not see any advantage going through the non trivial step of exporting everything is some close enough format, do some more formatting by hand, and then reimporting the data in the very same R session.  Is there a way of feeding the data *as it is* to MAPfastR?  I would be happy to change the way the microsatellite alleles are coded, that would be a snap, but I can see that recoding the whole dataset will be a massive hassle.

Please note, I am well aware I am being a pain by asking the above, but I also realise that for MAPfstR to be successfull it must be used, and it must be 'useable' to be used.  I offer the question above as a user feedback, not as a criticism -- I'd love to use MAPfastR and to see it become successful.  I am also happy to provide more details and feedback if you need them.

Best wishes



Oct 15, 2013, 5:28:59 AM10/15/13

Hi Federico!

The MAPfastR data object is a regula R list, so what you need to do is re-shape and split your current data frame to fit the structure, and you should be good to go. I attach some code that does that for the example you provided below. You should be able to use it on real data with only very minor modifications, I think. The comments describe where some things must be added.


#Assume that the data is currently in the data frame "Fed_data"
#In case you want to start from a datafile of the format in the post
#Fed_data <- read.delim("/path/datafile.txt", sep = "", stringsAsFactors = F)

#First we extract and reshape the genotypes
Fed_temp_geno <- t(Fed_data[,5:dim(Fed_data)[2]])
Fed_geno <- = dim(Fed_temp_geno)[2]*2, nrow = dim(Fed_temp_geno)[1]/2))
odd <- seq(from= 1, to = dim(Fed_temp_geno)[1], by = 2)
even <- odd + 1
geno_col <- 1
for(col in 1:dim(Fed_temp_geno)[2]){
    Fed_geno[, geno_col] <- Fed_temp_geno[odd, col]
    Fed_geno[, geno_col+1] <- Fed_temp_geno[even, col]
    geno_col <- geno_col + 2   

#This dataframe you need to construct from your marker information data
Fed_placeholder <- data.frame(chr = 1, sex_1_cM = c(1,2,3), sex_2_cM = c(1,2,3), ref_cM = c(1,2,3))

#Here we finalise the genotypes into MAPfastR from
names(Fed_geno) <- rep(Fed_data[,"id"], each = 2)
Fed_geno <- cbind(Fed_geno, Fed_placeholder)

#Now, the pheno data.frame
#First. we deduce the generation of each individual from the pedigree
Fed_generation <- array(dim=dim(Fed_data)[1])
gen <- 1
prev_gen <- array()
curr_gen <- array()
while (gen <= 3){
    hit <- 1
    for (ind in 1:dim(Fed_data)[1]){
        if (gen == 1){
            if (Fed_data[ind, 2] == 0 & Fed_data[ind, 3] == 0){
                Fed_generation[ind] <- gen
                curr_gen[hit] <- Fed_data[ind, "id"]
                hit <- hit + 1
            if (Fed_data[ind, 2] %in% prev_gen & Fed_data[ind, 3] %in% prev_gen){
                Fed_generation[ind] <- gen
                curr_gen[hit] <- Fed_data[ind, "id"]
                hit <- hit + 1
    prev_gen <- curr_gen
    curr_gen <- array()
    gen <- gen + 1
#Here you need to add line origin (for the founders in particular)
Fed_line <- array(dim = 10)

#Then we extract, combine and reshuffle
Fed_pheno <- cbind(Fed_generation,Fed_data[,c(4,2,3)], Fed_line)

#We conform to MAPfastR conventions
names(Fed_pheno) <- c("generation", "sex", "parent_1", "parent_2", "line")
Fed_pheno[,"sex"] <- Fed_pheno[,"sex"] + 1
Fed_pheno[,"sex"] <- 2/Fed_pheno[,"sex"]

#Finally, we make a MAPFastR object, and add some bookkeeping parameters
#note the "$heterogam" and make sure that lines up with your sex encoding
Fed_MAPfastR_data <- list(pheno = Fed_pheno, geno = Fed_geno)
Fed_MAPfastR_data$backcross <- 0
Fed_MAPfastR_data$backcross.line <- NA
Fed_MAPfastR_data$backcross.parent <- NA
Fed_MAPfastR_data$sex.restrict <- 0
Fed_MAPfastR_data$sex.chrom <- NA
Fed_MAPfastR_data$heterogam <- 1

Federico Calboli

Oct 21, 2013, 11:27:38 AM10/21/13
Hi Mats,

thank you for the help -- I regigged the code you sent to fit my whole data (genotypes + phenotypes), and thus far I could create objects of class MAPfastR without trouble.  Looking forward to getting some results now!



On Thursday, 21 March 2013 10:12:39 UTC, Ronnie Nelson wrote:
Reply all
Reply to author
0 new messages