Function
The GFF file given by gffin=FILE
will be parsed and the components optionally output to tables, a text file of
comment lines (starting #
) and fasta format sequences if given. The GFF filename sets the output prefix,
which can be over-ridden with basefile=FILE
.
The default fields parsed from the GFF are: locus, source, feature, start, end, score, strand, phase, attributes
.
Additional fields can be extracted from the attributes field, using attributes=LIST
. Setting attributes="*"
or
attributes=all
will extract all attributes into additional fields. Note that the attributes
field itself will be
kept unless attfield=F
is used to remove it.
integrity=T
will perform checks that the features do not go outside the range of the parsed sequence-region and/or
fasta sequences.
indelwarn=T
and stopwarn=T
will identify adjacent CDS features that may have sequencing and/or translation
errors. indelwarn=T
looks for adjacent CDS features with the same (or hyplist=LIST
) "product" (warnfield=X
)
annotation that are within 3 nt of each other (generally overlapping) and might thus represent a fragmented ORF due
to a frameshift error. stopwarn=T
identifies similar features that have exactly one codon between them, which
could represent an atypical genetic code being mis-translated as a stop codon.
joinseq=T
will output joined sequences to *.joined.gff
and, if sequences are parsed, *.joined.aa.fas
and
*.joined.nt.fas
. For protein sequence translations, stopwarn
sequences are joined with a *
. indelwarn
sequences are joined with flanking and internal xx
pairs that delineate the overlapping parts of each
annotated protein sequence.
NOTE: Only GFF3 is currently supported.
Commandline
Input/Output Options
gffin=FILE
: Input GFF file to parse [None
]
seqin=FILE
: Optional fasta file of reference sequences [None
]
gfftab=T/F
: Whether to output parsed GFF file as a delimited table with headers [True
]
gffloci=T/F
: Whether to parse sequence-region GFF comments to *.loci.tdt
[True
]
gffcomment=T/F
: Whether to output parsed GFF comments to *.comments.txt
[False
]
gfffasta=T/F
: Whether to output parsed GFF sequences to *.fasta
[False
]
attributes=LIST
: List of attributes (X=Y
;) to pull out into own fields ("*" or "all" for all) [*
]
attfield=T/F
: Whether to keep the full attribute field as parsed from the GFF file [False
]
gffout=FILE
: Save updated GFF format to FILE [None
]
gffseq=T/F
: Whether to include sequences in updated GFF file [False
]
GFF Processing Options
integrity=T/F
: Perform GFF integrity check based on parsed sequence-region comments and/or fasta [True
]
indelwarn=T/F
: Perform check for possible indels based on overlapping/close common features [True
]
hypindel=INT
: Number of hypothetical proteins that can be involved in a possible indel (0-2) [1
]
stopwarn=T/F
: Perform check for possible codon table stop codon errors based on close common features [True
]
warnfield=X
: Attribute field to use for generating indel or stop codon warnings [product
]
idfield=X
: Attribute field to use for CDS gene ID [ID
]
hyplist=LIST
: List of warnfield values to identify as hypothetical protein ['hypothetical protein'
]
cdsfeatures=LIST
: List of feature types to count as CDS for warning checks [CDS
]
joinseq=T/F
: Whether to join sequences possible affected by stop codons or frameshifts [False
]
History Module Version History
# 0.0.0 - Initial Compilation.
# 0.1.0 - Basic functional version.
# 0.1.1 - Modified for splice isoform handling
# 0.1.2 - Fixed parsing of GFFs with sequence-region information interspersed with features.
# 0.1.3 - Added option to parseGFF to switch off the attribute parsing.
# 0.2.0 - Added gff output with ability to fix GFF of tab delimit errors
# 0.2.1 - Added restricted feature parsing from GFF.
This server is still in development. Please report any odd/unwanted behaviour.