File Fomat

Ronnie Nelson

unread,

Mar 21, 2013, 6:12:39 AM3/21/13

to mapf...@googlegroups.com

Below is a description of the file format. Once the package is installed the files indicated are available in the package directory:

The triM format is used in the example files, as well as one phenotype file, and the following format needs to be used. Note that all the files are tab delimited.

The marker information files:

The marker information files provides information on the number of chromosomes, the total number of markers. The following lines contain the total number of markers per chromosome followed by a numeric ID that indicates which column pair in the genotype file contains the genotypes for that marker. This information is given on one line for each chromosome. The following lines contain the space between each marker for each chromosome provided (i.e. the map data), also one line for each chromosome. The last block of lines contains the name of each marker, one line for each chromosome. Note that the first column of each row containing the positions and names of the markers should be filled with a '1'.

The example marker information file: “mrkinfo_test.txt ”

The genotype file:

The genotype file provides the genotypic information for each individual. The individual IDs are indicated in the first column of each row. The following columns are filled with integer values indicating the genotype at each locus in sequence. Every 2 columns correspond to one marker (i.e. one allele in each column) and the columns are arranged sequentially in the order described in the marker file.

The example marker information files: “marker_test.txt”

The pedigree file:

The pedigree file is arranged in full-sib families. For each family, the number of F2 individuals within the family are provided. This is followed by the individual ID's in the first column (starting with the F0 generation, then the F1s and then the F2s). For each individual his/her parent's ID's are provided in the next two columns followed by it's sex. Note, for the F0 generation the parents are indicated with a '0' while an additional column with line origins is provided.

The example pedigree file: “ped_test.txt”

The phenotype file:

The phenotype files provide the information of the phenotypes of all the individuals. The first line contains the heading “ID” in the first column, followed by the phenotype names. The columns are filled with, first the individual ID’s and then the phenotypic values as indicated by the headings.

The example pedigree file: “pheno_test.txt ”

Federico Calboli

unread,

Oct 8, 2013, 10:22:07 AM10/8/13

to mapf...@googlegroups.com

Ronnie,

what happens if I have my data in R already? I ask this because I see a real risk of getting into 'bioinformatics hell', where most of the time is spent formatting and reformatting data to pass it along, rather than doing actual analyses.

For instance, I have an F1 family that looks like:

id mother father sex M1/1 M1/2 M2/1 M2/2 M3/1 M3/2

A00 0 0 1 256 268 162 166 256 272

B00 0 0 0 240 270 162 172 256 272

A01 B00 A00 0 268 270 162 162 272 272

A02 B00 A00 1 240 268 162 162 272 272

A03 B00 A00 0 240 268 166 172 272 272

A04 B00 A00 1 256 270 162 162 272 272

A05 B00 A00 1 240 256 162 172 NA NA

A06 B00 A00 1 240 256 166 172 272 272

A07 B00 A00 1 240 268 166 172 272 272

A08 B00 A00 0 256 270 162 172 272 272

where the first column is the individual, the second and third the pedigree and the fourth the sex, followed by the length of the microsatellite marker for every locus (two columns per locus, marched as locusname/1 and locusname/2). The distance between markers is available as a vector of cM distances

0 90.2 215.3

and the phenos are a normal data frame, one line per individual (parents included).

I really do not see any advantage going through the non trivial step of exporting everything is some close enough format, do some more formatting by hand, and then reimporting the data in the very same R session. Is there a way of feeding the data *as it is* to MAPfastR? I would be happy to change the way the microsatellite alleles are coded, that would be a snap, but I can see that recoding the whole dataset will be a massive hassle.

Please note, I am well aware I am being a pain by asking the above, but I also realise that for MAPfstR to be successfull it must be used, and it must be 'useable' to be used. I offer the question above as a user feedback, not as a criticism -- I'd love to use MAPfastR and to see it become successful. I am also happy to provide more details and feedback if you need them.

Best wishes

Federico

Mats

unread,

Oct 15, 2013, 5:28:59 AM10/15/13

to mapf...@googlegroups.com

Hi Federico!

The MAPfastR data object is a regula R list, so what you need to do is re-shape and split your current data frame to fit the structure, and you should be good to go. I attach some code that does that for the example you provided below. You should be able to use it on real data with only very minor modifications, I think. The comments describe where some things must be added.

/Mats

#Assume that the data is currently in the data frame "Fed_data"
#In case you want to start from a datafile of the format in the post
#Fed_data <- read.delim("/path/datafile.txt", sep = "", stringsAsFactors = F)

#First we extract and reshape the genotypes
Fed_temp_geno <- t(Fed_data[,5:dim(Fed_data)[2]])
Fed_geno <- as.data.frame(matrix(ncol = dim(Fed_temp_geno)[2]*2, nrow = dim(Fed_temp_geno)[1]/2))
odd <- seq(from= 1, to = dim(Fed_temp_geno)[1], by = 2)
even <- odd + 1
geno_col <- 1
for(col in 1:dim(Fed_temp_geno)[2]){
    Fed_geno[, geno_col] <- Fed_temp_geno[odd, col]
    Fed_geno[, geno_col+1] <- Fed_temp_geno[even, col]
    geno_col <- geno_col + 2
}

#This dataframe you need to construct from your marker information data
Fed_placeholder <- data.frame(chr = 1, sex_1_cM = c(1,2,3), sex_2_cM = c(1,2,3), ref_cM = c(1,2,3))

#Here we finalise the genotypes into MAPfastR from
names(Fed_geno) <- rep(Fed_data[,"id"], each = 2)
Fed_geno <- cbind(Fed_geno, Fed_placeholder)

#Now, the pheno data.frame
#First. we deduce the generation of each individual from the pedigree
Fed_generation <- array(dim=dim(Fed_data)[1])
gen <- 1
prev_gen <- array()
curr_gen <- array()
while (gen <= 3){
    hit <- 1
    for (ind in 1:dim(Fed_data)[1]){
        if (gen == 1){
            if (Fed_data[ind, 2] == 0 & Fed_data[ind, 3] == 0){
                Fed_generation[ind] <- gen
                curr_gen[hit] <- Fed_data[ind, "id"]
                hit <- hit + 1
            }
        }
        else{
            if (Fed_data[ind, 2] %in% prev_gen & Fed_data[ind, 3] %in% prev_gen){
                Fed_generation[ind] <- gen
                curr_gen[hit] <- Fed_data[ind, "id"]
                hit <- hit + 1
            }
        }
    }
    prev_gen <- curr_gen
    curr_gen <- array()
    gen <- gen + 1
}
#Here you need to add line origin (for the founders in particular)
Fed_line <- array(dim = 10)

#Then we extract, combine and reshuffle
Fed_pheno <- cbind(Fed_generation,Fed_data[,c(4,2,3)], Fed_line)

#We conform to MAPfastR conventions
names(Fed_pheno) <- c("generation", "sex", "parent_1", "parent_2", "line")
Fed_pheno[,"sex"] <- Fed_pheno[,"sex"] + 1
Fed_pheno[,"sex"] <- 2/Fed_pheno[,"sex"]

#Finally, we make a MAPFastR object, and add some bookkeeping parameters
#note the "$heterogam" and make sure that lines up with your sex encoding
Fed_MAPfastR_data <- list(pheno = Fed_pheno, geno = Fed_geno)
Fed_MAPfastR_data$backcross <- 0
Fed_MAPfastR_data$backcross.line <- NA
Fed_MAPfastR_data$backcross.parent <- NA
Fed_MAPfastR_data$sex.restrict <- 0
Fed_MAPfastR_data$sex.chrom <- NA
Fed_MAPfastR_data$heterogam <- 1

Federico Calboli

unread,

Oct 21, 2013, 11:27:38 AM10/21/13

to mapf...@googlegroups.com

Hi Mats,

thank you for the help -- I regigged the code you sent to fit my whole data (genotypes + phenotypes), and thus far I could create objects of class MAPfastR without trouble. Looking forward to getting some results now!

BW

F

On Thursday, 21 March 2013 10:12:39 UTC, Ronnie Nelson wrote:

Reply all

Reply to author

Forward