The input GFF3 file you have the link to only contains one gene? Is that correct. If so then you should only get one gene in the output. The resulting GTF should only have the genes (ignoring all the evidence).
To convert for eval use these command lines (note the flags such as -g for gff3_merge so you are only looking at genes and the fast must be included in the file, so no -n flag)
gff3_merge -d maker_datastore_index.log -g -o some_file.gff
add_utr_start_stop_gff some_file.gff > some_file2.gff
maker2eval some_file2.gff
Note that all version of MAKER after 2.09 no longer have add_utr_start_stop_gff, the UTR is now always there explicitly, so you go strait from gff3_merge and then use maker2eval_gtf
However with that explanation, I have to wonder if EVAL is appropriate for you. EVAL requires a reference annotation set (that is assumed to be 100% perfect) for comparison, and you get a perfect score whenever you call the genes exactly identical to the reference set (which in itself has obvious bias, but we won't get into that). Given that you have no reference set it will not give you anything other than statistics for the distribution of introns and exon sizes.
Alternate means for quality given no reference genome are AED (computed for each gene as part of the MAKER run), this is basically a variation of EVAL like statistics run against evidence clusters rather than a reference genome, or you can just use % domain content.
See these links for examples of the statistics -->
Also a figure is attached with an example of quality analysis using combined AED, domain content, and comparative orthologs.
--Carson