Hi Dan,
Unsupported organisms like S. cerevisiae, are a bit tricky in a couple cases because we don't offer gene set databases or mapping files for them necessitating that you supply your own. That said, for the technical questions you're asking here I can offer some answers.
With regard to the dataset size issue; GSEA is intended to be run on the
whole gene expression universe of the organism, the issue isn't
strictly "how many genes you have in your dataset", it's "how many genes
you are assessing in relationship to the number of genes expressed in
the organism", if you have 75% genome coverage, you're probably fine for
comparing your dataset against gene sets that were generated for your
organism based on it's gene universe.
In order to run the standard GSEA method you need to have more than 7 samples per phenotype, if you meet this, which it sounds like you do, then running the standard method of GSEA with the default phenotype permutation is preferred.
In phenotype permutation mode GSEA performs it's assessments based on the probability that those specific gene sets would be enriched in a case where there is no phenotype correlation across the samples. In comparison, GSEA Preranked (and the standard mode in gene_set permutation ode) uses a null hypothesis essentially
based on the probability that a random set of the same size would be
enriched in your sample phenotype comparison.
The null hypothesis for the phenotype permutation mode is generally more informative; being directly linked to strength of the phenotype, however it's not possible to do this with a reasonable number of permutations in cases where there are fewer than 7 samples per phenotype, hence, the use of the alternative null distribution in the gene_set mode. If you aren't limited by replicate number, the phenotype mode is superior, however, this method does have lower power necessitating a looser FDR cutoff (GSEA makes calls based on a 0.25 FDR cutoff, as described in the documentation).
With respect to the standard (albeit in gene_set permutation) mode vs. GSEA preranked; there is no consensus and we don't generally offer specific advice on which to use in the circumstance where you can not use phenotype permutation mode. GSEA Preranked uses the same gene set permutation mode, so the only decision to make is if you prefer the internal GSEA ranking metric (signal to noise ratio), or you prefer to use your own calculated metric. GSEA run with gene set permutation mode and the non-default log2_ratio_of_classes method should be equivalent to running preranked on computed log2FC rankings. By default, however, GSEA uses the signal to noise ratio, which scales the magnitude of the expression difference by the standard deviation of the expression levels, this would generally be an improvement over running the Log2FC (in preranked or otherwise) in isolation as the magnitude of the log2fc isn't inherently linked to the significance and can be highly affected by expression variability. The method of scaling the Log2FC by the -log10(pValue) attempts to address this by rescaling the distribution using the significance of the change. I think in cases where you do not have enough samples to run phenotype permutation, like in this case you describe, the significance scaled log2FC distribution might be the best option, however, as these methods have not been rigorously benchmarked against each other, we can't offer a hard, official, recommendation. In your case however, assuming you have sufficient sample numbers in all cases, you phenotype permutation mode is preferred over either standard-in-gene-set-perm mode, or preranked.
Hopefully this helps, let me know if you have any additional questions, or if there was anything I failed to address
-Anthony
Anthony S. Castanza, PhD
Curator, Molecular Signatures Database
Mesirov Lab, Department of Medicine
University of California, San Diego