Gene Finding Methods

Outline

Overview

This document describes how we created gene models for the third annotated gene set for Neurospora crassa. Automated gene calls were created in a four-step procedure:

  • Gene location and structures were predicted by combining automated gene predictions, BLAST hits, EST alignments and manual annotation. This process is described in section Gene Structure Prediction.
  • Gene names were assigned to predicted gene structures based on homology to genes from related species. This process is described in section Gene Naming.
  • Names, symbols, ontology terms and other tags were assigned from the ongoing community annotation project. Details of this project may be found here.
  • The newly created genes were compared against those in the previous version. This process is described in section Gene Correspondence.

  • back to top

    Gene Structure Prediction

    Gene structures were predicted using a combination of manual annotation, FGENESH, GENEID, and GENEWISE. FGENESH is a commercial gene prediction program sold by Softberry, while GENEID, by Enrique Blanco and Roderic Guigó, is available under the GPL. GENEWISE is part of the WISE2 package developed by Ewan Birney and is available from the Sanger Center. We used GENEID 1.2 and GENEWISE 2.2. FGENESH is not versioned.

    Both FGENESH and GENEID are classified as ab initio predictors; they predict genes from genomic sequence using statistical models of gene structure. Their models must first be trained against a set of trusted genes to ensure accurate prediction. Softberry trained FGENESH using proprietary methods and generated a parameter file for us. (We used the identical FGENESH parameter file for the version 2 gene set.) GENEID was trained using a set of 515 manual annotations on a previous assembly. The parameter file is available here. (Note, we changed the intron probabilities slightly from the parameters given by this link).

    GENEWISE is a homology-based predictor of gene structure. As we ran it, it splices and aligns an input protein sequence to genomic sequence. When the resultant spliced alignment forms a valid gene model, we treat it as such. While GENEWISE does accept some parameters that vary from species to species (most notably for intron nucleotide statistics and splice site consensus sequences) these can be set to generic default values which are non–species-specific. In this default case, GENEWISE essentially produces the best local alignment of a protein assuming that introns tend to start at GT and end at AG. Since we are interested in predicting complete gene structures, we post-process incomplete GENEWISE protein alignments by moving the first and last exon upstream or downstream to the closest possible start and stop codons respectively. If a stop codon was encountered upstream of a gene before a start could be found, the gene call was not used.

    Briefly, the gene predictions were combined as follows. The method we used for this new gene set is completely different from that used for gene set version 2.

    1. Both FGENESH and GENEID were run on the entire genomic sequence to provide an initial set of predicted genes. This resulted in a set containing 10,266 FGENESH predictions and 12,549 GENEID predictions.
    2. Next, GENEWISE was run on each BLAST hit from NR (after removing hits from N. crassa) that aligned to the genome with ≥ 80% identity and ≥ 80% similarity. This created 6,867 GENEWISE predictions. Because BLAST hits often overlap, the GENEWISE predictions fell into just 513 loci.
    3. 1265 genes were manually annotated from EST alignments and BLAST hits.
    4. Next, short predictions were filtered out. Any gene prediction less than 30aa (90nt) long was dropped. In addition, any FGENESH or GENEWISE gene prediction less than 50aa (150nt) long was dropped, unless it was overlapped by BLAST evidence, a hmmer hit, or EST evidence. All GENEID predictions, regardless of length, were dropped unless they had overlapping evidence.
    5. Wherever remaining predictions had overlapping exons, on identical or opposing strands, we clustered them into separate groups for further filtering. A set of non-overlapping transcripts was chosen from each cluster by ordering the genes from "best" to "worst", picking the "best" gene, then going down the list and adding any lower-ranked transcripts that did not overlap ones already selected. Transcripts were ranked according to the following criteria:

      • Intron lengths in fungi are much shorter than those in other eukaryotes. Even after training, FGENESH and GENEID occasionally predict genes that have introns that are much too long for fungi. If a transcript has an intron longer than 1000bp, it will not be chosen if there is a competing gene model with more reasonable intron lengths.
      • Where EST evidence is available, ESTs are linked together to form partial splice models. Each transcript is compared to these EST transcripts, and its splice junctions are scored. Each splice junction may be a true positive, a false positive (the prediction has a splice site where the EST evidence does not), or a false negative (the EST evidence is spliced where the prediction is not). Transcripts with fewer false predictions are chosen above those with more. Those with more true positives are chosen over those with less. Splice scores are adjusted to account for alignment errors in ESTs as well as potential instances of alternate splicing.
      • A transcript was considered to have "good BLAST coverage" if it overlapped one or more BLAST hits, each with ≥ 50% average identity and ≥ 70% query coverage. BLAST hits from N. crassa were ignored.
      • Predictions with good BLAST coverage were chosen above predictions that did not have good BLAST coverage.
      • If two predictions both had good BLAST coverage, we computed the "average BLAST length" for each transcript; that is, the average length of the top 1–3 overlapping BLAST hits from different taxa. (When there were more than three, the average was computed from the three with the highest scores.) The prediction closer in length to its average BLAST length was chosen first.
      • Otherwise, FGENESH was preferred to GENEWISE, and both were preferred to GENEID.
    6. After sorting and selecting the highest-ranked predictions, the resultant gene set contained 9,826 genes.


    back to top

    Untranslated Region (UTR) Prediction

    1949 genes in this set have annotated UTR. Of these, 1242 were annotated by hand. The remaining 707 instances of UTR were predicted computationally, as follows. When an EST alignment uniquely aligns with > 95% identity and overlaps a gene prediction, and the region of overlap has 100% coding agreement (e.g., every nucleotide in the region of overlap is coding in both the prediction and the alignment, or is noncoding in both), we consider the prediction and alignment to be compatible. UTR predictions are generated by walking out along compatible EST alignments from the end(s) of the prediction. A chain of one or more overlapping, compatible EST alignments that begin at the 3′ or 5′ end forms each UTR extension. Our UTR predictions are conservative. When an EST aligns to more than one location on the genome, or touches more than one gene prediction (on either strand) it is ignored.


    back to top

    Gene Naming

    Genes are assigned names very conservatively. As this is a purely automated gene prediction process, we seek to avoid transferring unverified functional names from genes in other species to our predicted genes. Hits to previous N. crassa gene annotations are not considered in the following assignments.

    We hope to improve the gene naming process in the future based on Gene Ontology categories.

    There are currently five types of gene name that fall into three categories:

    1. NAME, or
      hypothetical protein similar to NAME, or
      conserved hypothetical protein

      Assigned to gene predictions where there is strong homology to a known NR protein. The criteria for this category are:

      • At least one BLASTX hit to a known NR protein (complexity filtering off, -F F, expect ≤ 1e-5), with
      • ≥ 50% identity and ≥ 70% coverage of the protein to the underlying genomic sequence, and
      • at least 25% overlap between the protein aligment and the final prediction.

      The name will follow one of these three formats, depending on our confidence in the name itself:

      • conserved hypothetical protein if the homologous protein NAME contains a word indicating the name has not been verified: {fragment, homolog, hypothetical, like, predicted, probable, putative, related, similar, synthetic, unknown, unnamed}, or
      • NAME if the homologous protein is from the curated Swiss-Prot gene set, otherwise
      • hypothetical protein similar to NAME

      Where there is more than one suitable name for a BLAST hit, we prefer Swiss-Prot names to non-Swiss-Prot names. If there are multiple distinct BLAST hits we choose the one with the highest average identity × the amount of overlap to the target gene.

      In all cases we filter out the species name, GIs, parenthetical comments, extra whitespace, etc. from the NR name.

    2. Hypothetical protein

      Assigned to gene predictions that show significant BLASTP homology to a protein in NCBI's protein set NR. The criteria for this category are:

      • BLASTP hit to NR (complexity filtering off, -F F, expect ≤ 1e-5)

    3. Predicted protein

      Assigned to gene predictions that do not show significant BLASTP homology to any proteins in NCBI's non-redundant set of proteins (NR) at the time that the complete BLASTP analysis was performed on the gene set.


    back to top

    Name counts

    177 transcript(s) had different names than their parent.
    2452 transcript(s) had non-generic names.

    "conserved hypothetical protein"2189
    "hypothetical protein"393
    "predicted protein"4812
     hypothetical protein similar to...1354
     other non-empty name1098

    back to top

    Gene Locus Numbers

    Every annotated gene is assigned a Locus Number of the form NCU#####. Locus numbers are the only guaranteed way to uniquely identify a gene. When the same gene exists in different assemblies and annotations, it will have the same locus number. Loci are simply identifiers; they do not have any internal structure. In particular they are not ordered in any way. We believe it is unreliable to encode attributes of an object, such as position, in its stable identifier. Instead, position and other attributes of a gene can be retrieved once you know its locus.

    With each new gene set, we map all genes from the previous set and thus preserve loci whenever possible. Any loci that cannot be mapped 1:1 (due to deletions, splits or merges) will be retired. New genes that do not have a 1:1 mapping to a previous gene receive new locus identifiers. In addition to its locus, each gene has a version number (so loci are in fact displayed as NCU#####.#). When genes are mapped from one assembly to another or when we publish a new set of gene calls, we will increment this version. To ensure consistency, all loci in a particular gene set have the same version number, even if some have not changed from the previous set.

    Both annotations and assemblies are versioned. Loci versions refer to the version of the annotation, not the assembly release number. This is annotation version 3 for N. crassa. Like the previous annotation (version 2) it was generated from assembly release 7. Version 1 of the annotation was built on assembly release 3. (Assembly releases 4–6 were not annotated.)

    A locus number without an explicitly specified version corresponds to the most recent version of the annotation. For example, NCU03456.1 refers to locus NCU03456, annotation 1 (as annotated on assembly release 3). NCU03456.3 and NCU03456 both refer to the same locus as annotated in version 3 (on assembly release 7).


    back to top

    Correspondence of version 2 and version 3 gene calls

    To compare version 2 and version 3 gene calls, 10,620 gene calls from version 2 were mapped to the 9,826 version 3 annotations.

  • 9,565 genes correspond one-to-one between version 2 and version 3. Of these,
    • 6,218 were completely unchanged from version 2,
    • 1,399 had identical CDS, but the annotation was extended to include UTR,
    • 612 had the same start and stop, but at least one splice site changed, and
    • 1,336 had a different start and/or stop codon due to a substantial change in the underlying gene model.
  • 94 genes were added in loci where there was no previous annotation due to new BLAST and/or EST evidence.
  • 966 genes were deleted. Over half of these were short, coding for proteins less than 30aa long. The remainder were removed due to lack of evidence or poor gene models.
  • 10 genes in version 2 were merged into 5 larger genes in version 3.
  • 79 genes in version 2 were split into 162 smaller genes in version 3.

Annotations added by the Community Annotation Project were transferred automatically to gene predictions in version 3 that were unambiguously equivalent to the same prediction in version 2 (e.g. same locus with altered UTR, splicing, etc.). Version 3 predictions that are ambiguous (merges, splits, etc.) are being manually processed to ensure that the annotation is moved to the correct new gene.

Five genes which have been characterized for phenotype have been altered (deleted, merged, or split) in version 3. Transfer of the phenotype information will be reviewed manually to ensure that, where possible, the correct gene from version 3 will receive the appropriate phenotype characterizations.


back to top

Structure Prediction Validation

To evaluate the accuracy of our gene predictions, we created a set of reference gene models exclusively from EST data. We then compared the two gene sets using a variety of metrics. In the tables below, we refer to the final, published gene set as the query and the EST-based gene set as the reference.

The Feature comparisons and Splice analysis sections only report on the subset of query genes that overlap reference genes. Although a substantial number of predicted genes overlap EST alignments, the majority do not. Because we use EST data to improve our gene calls, we expect lower accuracy in regions that lack supporting EST evidence, on the order of 5–10%. Therefore, while they are a useful measure of gene prediction accuracy, the numbers reported in those two sections do not apply evenly to all predicted genes.


back to top

Overview of Query Genes

9826 genes: 9846 transcripts (8055 spliced, 1791 unspliced; 4869/4977 +/-)
27208 exons, 17362 introns

len%cov%gc%at
genic1738249944.3154.6745.33
intergenic2184333655.6945.0054.99
exonic1502179938.3055.7244.28
intronic23640176.0347.9852.01
coding1435592536.6056.1143.89
5′ UTR2128450.5449.9250.08
3′ UTR4560391.1646.0553.95
alt. spliced33170.0150.0849.92
genomic39225835100.0049.2950.71

min

median

mean

max
overall length (incl. UTR)901299152832463
coding length901214145932463
exons per transcript122.7624
exons per spliced transcript233.1624
bp per exon126355315030
bp per intron32841361388
5′ UTR bp1121161986
3′ UTR bp12462761423

back to top

Overview of Reference genes

3197 genes: 3353 transcripts (1319 spliced, 2034 unspliced; 1717/1636 +/-)
5580 exons, 2227 introns

len%cov%gc%at
genic23106835.8953.2546.75
intergenic3691515294.1149.0450.96
exonic20852415.3253.6846.32
intronic2352460.6049.4150.57
coding17843854.5554.8545.15
5′ UTR422990.1148.7151.29
3′ UTR2689530.6946.4453.56
alt. spliced98040.0251.3148.69
genomic39225835100.0049.2950.71

min

median

mean

max
overall length (incl. UTR)455606575100
coding length454905594488
exons per transcript111.6610
exons per spliced transcript222.6910
bp per exon43653955100
bp per intron21761141740
5′ UTR bp1107124598
3′ UTR bp1166187954

back to top

Feature Mapping

Searched for mappings from 9826 query features to 3197 reference features.

Reference genes farther than 200bp from any query gene: 230
Containing 230 transcripts: 205 unspliced, 25 with canonical splices.
These may indicate missed genes.

51 query genes (0.5%) hit 44 reference genes (1.4%) on the opposing strand. Hits to opposing strands are considered misses in the following comparisions.

Strand mismatches such as these may be orientation problems in the reference set, or incorrect predictions in the query set. Unspliced reference transcripts are difficult to orient correctly. 32 reference gene(s) did not contain a single canonical splice site. These may easily have been placed on the wrong strand; queries hitting these genes are likely to be oriented correctly. 12 reference gene(s) had at least one canonical splice site; everything else being equal, their orientation is more likely to be correct than the queries hitting them.

The table below shows how query genes (rows, roman type) map to reference genes (columns, italic type). We cluster genes into overlapping groups on the same strand and record eight types of relationship. To the left of each arrow is the number of query genes in each type of cluster; the number of reference genes is to the right.

nonereference
one
many
none-045000
queryone77630140214026221315
many2023111419
counts per cluster type
 
nonereference
one
many
none-014.100
queryone79.0014.343.96.341.1
many0.000.20.30.10.6
percentages per cluster type

one↔one clear, unambiguous mappings from one query gene to one reference gene. none↔one reference genes that do not map to any query gene (misses). none↔many reference genes that overlap other reference genes but do not map to query genes. None↔X mappings may represent underpredictions. one↔none query genes that do not map to any reference gene. many↔none query genes that overlap other query genes but do not map to reference genes. X↔none mappings may represent overpredictions or deficiencies in the reference gene set. one↔many single query genes overlapped by multiple reference genes (merges). These may occur if the reference set contains partial genes. many↔one single reference genes spanning multiple query genes (splits). These are often annotation errors. many↔many complex clusters where multiple query genes map to multiple reference genes.


back to top

Feature Comparisons

Predicted genes that hit reference genes on the wrong strand, or hit reference genes containing only non–canonically-spliced transcripts, are ignored in the following calculations. When scoring genes with multiple transcripts, we select the single best transcript↔transcript comparison. For clusters in the one↔many and many↔many categories above, each query is compared to as many references as possible. The best non-overlapping comparisons are used to score these loci.

#compTPTNFPFN
nucleotides20158631766286215385240531787
splice sites62395595-401243

sn

sp

smc

acp

ac

cc
nucleotides0.98230.99860.98300.96030.92070.9193
splice sites0.95840.93310.8968---

#compTPTNFPFN
introns21531870-11111
exons39783417-324

sn

sp

sn_sp

wrong

miss

introns0.95020.86860.90940.00560.0516
exons0.82900.85900.84400.00580.0008

#comp the number of subfeatures compared. TP true positives: nucleotides predicted as exonic in both query and reference; or, splice junctions with exact agreement, in position and type (donor:donor, acceptor:acceptor), between query and reference; or, introns or exons with both splice sites in exact agreement with the reference. TN true negatives: nucleotides predicted as intronic in both query and reference; not defined for splice sites, introns or exons. FP false positives (overpredictions): nucleotides marked as exonic in the query but intronic in the reference; or, splice junctions identified only in the query*; or query introns/exons that do not overlap a reference intron/exon. FN false negatives (underpredictions): nucleotides marked as intronic in the query but exonic in the reference; or, splice junctions identified only in the reference*; or reference introns/exons that do not overlap a query intron/exon. sn sensitivity, TP ÷ (TP+FN). sp specificity, TP ÷ (TP+FP). smc simple matching coefficient, (TP+TN) ÷ (TP+FN+FP+TN). acp average conditional probability. ac approximate correlation. cc correlation coefficient. These last three terms are described in detail in the citation below, and are not well defined for splices, introns or exons. *In the very rare cases where the query predicts a donor site exactly where the reference has an acceptor, or vice versa, we count it as both FP and FN. Terminology is from Burset M and Guigo R, "Evaluation of gene structure prediction programs." Genomics, 1996 Jun 15;34(3):353-67.

The table below briefly summarizes the relative coverage of overlapping query and reference genes. The first column shows the amount of sequence (in bp and percent) in each gene set that is overlapped by sequence in the other gene set. The next three columns show the overhanging sequence from transcripts that overlap but are offset. Overhangs are categorized as 5′ (one transcript begins upstream of the other), 3′ (one transcript extends downstream past the other), and mixed (indeterminate). Finally, the last column, "complete miss", counts the number of bp in queries that did not hit a reference at all, and vice versa.

A high degree of overlap can indicate a good match between the query and reference sets. A high "missed" score may simply indicate a small reference set.

overlapoverhang, 5′overhang, 3′overhang, mixedcomplete miss
query2015863 bp/11.6%784908 bp/4.5%528976 bp/3.0%763908 bp/4.4%13292184 bp/76.5%
reference2015863 bp/87.0%32040 bp/1.4%29623 bp/1.3%19653 bp/0.8%219600 bp/9.5%

back to top

Splice Analysis

5595 splice agreements, 644 disagreements.
3621 ignored: 42 due to EST misalignment, 3579 due to partial initial/terminal exon coverage.
perfect exon:exon/intron:intron matches: 1714/1870
0 query transcripts contained noncanonical splices.

transcripts with no splice problems181388.0%
   ... with complete reference coverage25512.4%
explainable by alternate splicing1055.1%
clashes1426.9%

Transcripts that have a splice site dis­agree­ment with an over­lap­ping ref­er­ence gene are placed into two cate­gor­ies, de­pend­ing on the se­ver­ity of the clash. If all splice dis­agree­ments could be ex­plain­ed by well-known types of al­ter­nate splic­ing, we call the trans­cript a "pos­sible al­ter­nate splice." If the two trans­cripts cannot be rec­on­ciled in this way, we label the query a "clash." In par­ti­tio­ning splice dis­agree­ments into two cat­e­gor­ies, we are not as­sert­ing that 5.1% of this ge­nome shows al­ter­nate splic­ing. We do this as a form of triage: genes in the "clash" cat­e­go­ry are man­ually in­spect­ed before release.

in ref.in query
cassette exons60
retained introns980
early 3′ splices617
late 5′ splices58

cassette exon an exon that falls completely with an intron of a variant transcript. Such exons may represent alternative splice forms but are more likely instances of exonic over- and under-prediction. retained intron an intron that falls within the exon of a variant transcript. These introns may indicate alternative splicing but usually are over- and under-predicted introns. early 3′ splices two introns agree on their 5′ splice site but differ on the 3′ side, relative to the affected intron. In other words, differing 3′ splice sites lie on the leading edge of an exon. late 5′ splices two introns agree on their 3′ splice site but differ on the 5′ side, again relative to the affected intron. Most ter­mi­nol­ogy is from Mat­lin AJ, et. al. Under­stand­ing al­ter­na­tive splic­ing: to­wards a cel­lular code. Nat Rev Mol Cell Biol. 2005 May;6(5):386-98.


back to top

Possible Problems

{multiple sources}1947
NC7_GENEWISE_3238
NC7_FGENESH_15346
NC7_GENEID_21243
NC7_MANUAL_11272
short proteins < 50aa19100180← not tallied in problems
shorter proteins < 30aa0-----
very short proteins < 10aa0-----
initial exon ≤ 6bp145190106200
internal exon ≤ 6bp47004601
terminal exon ≤ 6bp572034210
≥ 15 exons500500
intron ≥ 1000bp800710
intron ≤ 20bp0-----
first codon not Met271111221← not tallied in problems
first codon not xTG500221
first codon not known START500221
last codon not STOP410120
contains in-frame STOP100001
coding length not modulo 3100001
non-canonical splicing200002
has ≥1 good BLAST hit46021637389701105852← not tallied in problems
≤1/3 as long as BLAST hit5110010256
≥3× longer than BLAST hit6922212285
contains ≥1 N in exon420101
contains low-scoring sequence0-----
touches gap(s)0-----
spans contigs0-----
within 1kb of contig edge1020431← not tallied in problems
any overlap (UTR or CDS)323016936← in 32 clusters
CDS overlap only0-----
CDS overlap > 50bp0-----
CDS overlap > 100bp0-----
CDS overlap > 200bp0-----
has predicted UTR194927422621691242← not tallied in problems
UTR ≥ 50% length2161201323168← not tallied in problems
UTR is spliced157180912118← not tallied in problems
one or more problems44258222610551

back to top