Module:	rje_gff
Description:	GFF File Parser and Manipulator
Version:	0.2.1
Last Edit:	20/11/20
Webserver:	http://www.slimsuite.unsw.edu.au/servers/gff.php

Imported modules: rje rje_obj rje_db rje_seqlist rje_sequence

See SLiMSuite Blog for further documentation. See rje for general commands.

Function

The GFF file given by gffin=FILE will be parsed and the components optionally output to tables, a text file of comment lines (starting #) and fasta format sequences if given. The GFF filename sets the output prefix, which can be over-ridden with basefile=FILE.

The default fields parsed from the GFF are: locus, source, feature, start, end, score, strand, phase, attributes. Additional fields can be extracted from the attributes field, using attributes=LIST. Setting attributes="*" or attributes=all will extract all attributes into additional fields. Note that the attributes field itself will be kept unless attfield=F is used to remove it.

integrity=T will perform checks that the features do not go outside the range of the parsed sequence-region and/or fasta sequences.

indelwarn=T and stopwarn=T will identify adjacent CDS features that may have sequencing and/or translation errors. indelwarn=T looks for adjacent CDS features with the same (or hyplist=LIST) "product" (warnfield=X) annotation that are within 3 nt of each other (generally overlapping) and might thus represent a fragmented ORF due to a frameshift error. stopwarn=T identifies similar features that have exactly one codon between them, which could represent an atypical genetic code being mis-translated as a stop codon.

joinseq=T will output joined sequences to *.joined.gff and, if sequences are parsed, *.joined.aa.fas and *.joined.nt.fas. For protein sequence translations, stopwarn sequences are joined with a *. indelwarn sequences are joined with flanking and internal xx pairs that delineate the overlapping parts of each annotated protein sequence.

NOTE: Only GFF3 is currently supported.

Commandline

Input/Output Options

gffin=FILE : Input GFF file to parse [None]
seqin=FILE : Optional fasta file of reference sequences [None]
gfftab=T/F : Whether to output parsed GFF file as a delimited table with headers [True]
gffloci=T/F : Whether to parse sequence-region GFF comments to *.loci.tdt [True]
gffcomment=T/F : Whether to output parsed GFF comments to *.comments.txt [False]
gfffasta=T/F : Whether to output parsed GFF sequences to *.fasta [False]
attributes=LIST : List of attributes (X=Y;) to pull out into own fields ("*" or "all" for all) [*]
attfield=T/F : Whether to keep the full attribute field as parsed from the GFF file [False]
gffout=FILE : Save updated GFF format to FILE [None]
gffseq=T/F : Whether to include sequences in updated GFF file [False]

GFF Processing Options

integrity=T/F : Perform GFF integrity check based on parsed sequence-region comments and/or fasta [True]
indelwarn=T/F : Perform check for possible indels based on overlapping/close common features [True]
hypindel=INT : Number of hypothetical proteins that can be involved in a possible indel (0-2) [1]
stopwarn=T/F : Perform check for possible codon table stop codon errors based on close common features [True]
warnfield=X : Attribute field to use for generating indel or stop codon warnings [product]
idfield=X : Attribute field to use for CDS gene ID [ID]
hyplist=LIST : List of warnfield values to identify as hypothetical protein ['hypothetical protein']
cdsfeatures=LIST: List of feature types to count as CDS for warning checks [CDS]
joinseq=T/F : Whether to join sequences possible affected by stop codons or frameshifts [False]

History Module Version History

    # 0.0.0 - Initial Compilation.
    # 0.1.0 - Basic functional version.
    # 0.1.1 - Modified for splice isoform handling
    # 0.1.2 - Fixed parsing of GFFs with sequence-region information interspersed with features.
    # 0.1.3 - Added option to parseGFF to switch off the attribute parsing.
    # 0.2.0 - Added gff output with ability to fix GFF of tab delimit errors
    # 0.2.1 - Added restricted feature parsing from GFF.

This server is still in development. Please report any odd/unwanted behaviour.

Run

Upload GFF file and set options below, then click:

After running, click on the features tab to see the main table of GFF features.

GFF Input Options:

GFF file upload:

Optional reference fasta file upload:

Output Options

Features table | Loci (sequence-region) table | GFF comments | GFF sequence FASTA
Join sequences possibly affected by stop codons or frameshifts

List of attributes (X=Y;) to pull out into own fields ("*" or "all" for all):
*
Whether to keep the full attribute field as parsed from the GFF file

Processing Options

Perform GFF integrity check based on parsed sequence-region comments and/or fasta
Perform check for possible indels based on overlapping/close common features
Number of hypothetical proteins that can be involved in a possible indel (0-2):
Perform check for possible codon table stop codon errors based on close common features

Attribute field to use for generating indel or stop codon warnings:

Attribute field to use for CDS gene ID:

List of warnfield values to identify as hypothetical protein:
hypothetical protein
List of feature types to count as CDS for warning checks:
CDS

Advanced Options

Other options:

SLiMSuite REST Server

rje_gff V0.2.1

GFF File Parser and Manipulator