Magnaporthe grisea Automated Gene Calling

Outline

Overview

This document describes some of the details of the methodology used to produce the automated gene calls for the genome of Magnaporthe grisea. Automated gene calls were produced in essentially a two step procedure:
  • Gene location and structures were predicted using a combination of FGENESH, FGENESH+, and GENEWISE. This process is described in section Gene Structure Prediction.
  • Gene "names" were assigned to predicted gene structures based on homology to previously annotated genes. This process is described in section Gene Naming.

Gene Structure Prediction

Gene structures were predicted using a combination of FGENESH, FGENESH+, and GENEWISE. Both FGENESH and FGENESH+ are gene prediction programs acquired from Softberry.com and GENEWISE is part of the WISE2 package developed by Ewan Birney and is available from the Sanger Center.

Both FGENESH and FGENESH+ utilize a statistical model of gene structure that require training on each organism for accurate prediction. FGENESH+ additionally combines a protein sequence with the statistical model to improve accuracy. We acquired these programs already trained by Softberry on Magnaporthe sequences. FGENESH was used by MIPS for their automated annotation of LGII and LGV.

GENEWISE (as we ran it), splices and aligns a protein sequence with genomic sequence to predict a gene structure. Although GENEWISE does utilize some species-specific parameters, most notably for intron nucleotide statistics and splice site consensus sequences, these can be set to non-species specific defaults. In this case, GENEWISE essentially produces the best local alignment of a protein assuming that introns start at GT and end at AG most of the time and in some cases this results a full alignment of the protein to the genome. Since we are interested in predicting complete gene structures, we post-processed GENEWISE incomplete protein alignments by moving the first and last exon upstream or downstream to the nearest start and stop codons respectively. If a stop codon was encountered upstream of a gene before a start could be found, the gene call was not used.

An assessment of the accuracy of GENEWISE as well as FGENESH, and FGENESH+ is described below in section Structure Prediction Validation.

Briefly, these three gene callers were combined in the following manner:

  1. FGENESH was run on the entire genomic sequence to provide an initial set of predicted genes. Each FGENESH predicted was put into a set of EVIDENCE_GENES.
  2. The genome was also searched against the non-redundant protein database using BLASTX
  3. Regions of the genome with blastx homology spanning over 80% of a protein (when sub-alignments are stitched together in a consistent fashion) were considered "Homologous Gene Regions" (HGRs).
  4. HGRs were clustered into groups of HGRs that all implicated the same gene structure (most often representing groups of essentially orthologous proteins).
  5. For each cluster of HGRs, the protein showing the most sequence similarity to the genome was passed to both FGENESH and GENEWISE to produce 2 gene predictions, if the protein had >80% amino acid identity to the translated genome (cumulative across sub-alignments).
  6. If the protein used in the previous had >90% amino acid identity to the translated genome (cumulative across sub-alignments), then the GENEWISE call, if valid, was favored over the FGENESH+ call, and was used as the EVIDENCE_GENE for the HGR (see below for the reason why) and added to the set of EVIDENCE_GENES. If this protein had >80% but less than 90% amino acid identity to the translated genome (cumulative across sub-alignments), then the FGENESH+ call, if valid, was favored over the GENEWISE call, and was used as the EVIDENCE_GENE for the HGR (see Structure Prediction Validation for the reason why) and added to the set of EVIDENCE_GENES.
  7. When EVIDENCE_GENES overlapped in their exons, the EVIDENCE_GENE with the least amount of homology support (as measured by the sequence similarity of the protein used to make the call or zero for FGENESH calls) was removed from the set of EVIDENCE_GENES.
  8. All remaining EVIDENCE_GENES were then called as our official ANNOTATED_GENES and passed to the next step of gene calling for Gene Naming.
(Since EVIDENCE_GENES represent potential alternate gene predictions that may be based on homology, these genes are available to the user on the website)

Gene Naming

Genes are assigned names VERY CONSERVATIVELY. Because this is a purely automated gene prediction process, we do not want to propogate mis-information by transfering unverified functional names for genes in one species to predicted genes in another species.

We hope to improve the gene naming process in the future based on Gene Ontology categories.

There are currently 5 types of gene names, that make up 3 categories:

  1. NAME, or
    hypothetical protein similar to NAME, or
    conserved hypothetical protein

    Assigned to gene predictions where there is excellent homology to an known NR protein. The criteria for this category are:

    • Top BlastP hit to a known NR protein (complexity filtering off -F F, expect <= 1e-5), with
    • >=80% identity and >= 80% coverage of both the query and subject sequence.

    The exact name is assigned:
    • NAME if the homologous protein is from the curated SwissProt gene set (IE we trust the gene name), otherwise:
    • conserved hypothetical protein if the homologous protein NAME contains a word in the set {hypothetical, homolog, probable, putative, similar to, predicted, unnamed, unknown} (IE we do not want to transfer suspect names), otherwise
    • hypothetical protein similar to NAME
    In all cases we take the NR protein name and try to filter out the species name, GIs, and extra whitespace

  2. Hypothetical protein
    Assigned to gene predictions that show significant BlastP homology to a protein in NCBI's protein set NR or an EST alignment. The criteria for this category are:
    • BlastP hit to NR (complexity filtering off -F F, expect <= 1e-5), or
    • EST hit (>=300nt, >=98%identity, >95% coverage) which overlaps gene

  3. Predicted protein
    Assigned to gene predictions that do not have an EST alignment or show significant BlastP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BlastP analysis was performed on the gene set. The criteria for this category are:
    • No BlastP hit to NR (complexity filtering off -F F, expect <= 1e-5), and
    • No EST hit (>=300nt, >=98%identity, >95% coverage) which overlaps gene

Gene Locus Numbers

Every annotated gene is given a Locus Number of the form MG##### that should be considered the only guaranteed way to identify a gene uniquely and positively. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure. We feel that encoding attributes of an object in the identifier for an object is a bad idea, and so please do not use the locus numbers to determine relative order of genes in the genome. Position is an attribute of a gene that can be retrieved by the locus.

With each new assembly, we do our best to map all genes from the previous assembly and thus preserve loci. Any loci that cannot be mapped will be retired. New genes will receive new loci.

All loci are versioned, and the version is appended to the locus name after a period, eg MG#####.1. The version of the locus will be incremented when a gene-structure or name change is made to a group of loci, and published as a new incremental release. All loci in a particular release will be assigned the same version number.

As of July 2003, we have modified the gene set (and thus updated versions) in incremental releases:

  1. Release 2: 11108 initial loci, all with version 1.

  2. Release 2.1: We improved our gene calling software, and this resulted in 17 genes being modified, 2 existing genes being deleted and 3 genes being added to our putative gene set. The release 2.1 gene set contains 11,109 genes. See Release 2.1 Details for more information.

    We incremented the version number to 2 only for the affected genes.

    After making this change, we received feedback that it was confusing for some genes in the release to have version 1 and other genes to have verion 2. We have since changed our policy to change the version number of all genes when we create a new release.

  3. Release 2.2: To be less conservative, we changed the names of a number of genes to match the current guidelines in section Gene Naming :
    • 3268 "predicted protein"s were renamed "hypothetical protein" based on BlastP homology or EST evidence.
    • 337 proteins with good homology to known Magnaporthe proteins were renamed.

    All 11109 Magnaporthe genes in release 2.2 have version 3. See Release 2.2 Details for more information.

    Structure Prediction Validation

    For a description of the structure prediction validation, please refer to the Neurospora crassa Automated Gene Calling page.

    Minimum ORF Length

    The gene prediction programs produced a large number of very short open reading frames.

    A final filter was applied to the gene set to discard open reading frames less then 100 amino acids, if there was no other evidence supporting the gene prediction. Any short genes that showed BLAST homology or BLAT alignment to ESTs were retained.

    This filter caused 1486 short orfs (of length less than 100 amino acids) to be discarded from the predicted gene set. These short orfs are available on the download page, but are not currently viewable in the feature search.