Edena v2.x for Windows: Simple Tutorial
(This Windows version of Edena is experimental)

OVERVIEW:

Edena is a software application dedicated to the de novo assembly of
very short reads of the same length (i.e issued from the Illumina
Genome Analyzer). Please, see the file "LICENCE.TXT" included in the
distribution.

REFERENCE:

D. Hernandez, P. Francois, L. Farinelli, M. Osteras, and J. Schrenzel.
De novo bacterial genome sequencing: millions of very short reads 
assembled on a desktop computer.
Genome Research. In press.

Invoking Edena with no argument provides the program usage. Edena
separates the overlapping and the assembling operations into two
distinct running modes.

1)The overlapping mode is invoked by providing the program with a
FASTA or FASTQ short reads dataset (-r option). In case of a FASTQ
file, the quality information is not used by Edena. The overlapping
mode calculates all the overlaps displaying a minimum length of 20 by
default. It builds a transitively reduced overlap graph and saves the
result in a binary file suffixed with ".ovl".

2)The assembling mode is invoked by providing the program with a .ovl
file (-e option). This mode performs the assembly and outputs three
files. The one suffixed with ".fasta" contains the contigs, the one
suffixed with ".cov" provides the depth coverage of each base in the
contigs, and finally the one suffixed with ".info" provides a summary
of the parameters that were used and some information regarding the
assembly run.

SIMPLE TUTORIAL:

-Edena must be run from a command line windows-

The "data" directory contains a simulated reads dataset that was
obtained by randomly sampling 160'000 reads of 35 bases from the 83319
bp Thermus Phage P74-27 genome. A base error rate of 1% has been
uniformly introduced in the read sequences.

The program is first invoked for the overlapping mode by typing:

edena.exe -r data/thermus.reads -p thermus

This command computes the overlaps larger or equal to 20 bases and
builds the transitively reduced overlap graph. Once finished, a file
named "myThermusAssembly.ovl" is created. This file encodes all the
information required for the assembling step.
 
The assembling mode is invoked with the '-e' option which requires a
.ovl file:

edena.exe -e thermus.ovl -p myThermusAssembly

This command will assemble the genome into 5 contigs. By default, the
assembling mode considers overlaps displaying a minimum length of 22
bases. As a result, the program indicates that all contig elongation
were interrupted due to branchings (ambiguities). Increasing the
required overlap length (switch -m) allows to resolve more
ambiguities. In this case, a single perfect contig can be obtained by
increasing this value to 26:

edena.exe -e thermus.ovl -m 26 -p Ov26MyThermusAssembly

By specifying an even higher value, contig elongations will be
interrupted because of a lack of overlapping reads (insufficient
coverage).

edena.exe -e thermus.ovl -m 30 -p Ov30MyThermusAssembly

This command will assemble the genome into 40 contigs, for which ends
end in a gap (i.e could not be elongated due to the lack of an
overlapping read). To resume, the required overlap length (-m) is a
determinant parameter and its optimal setting strongly depends on the
data you are assembling and particularly on the coverage depth that
was achieved by the sequencing process. You must try different
parameter values for the overlapping mode. Increasing the overlap
length cutoff allow to resolve more ambiguities, but in turn requires
a higher coverage depth. As a general rule, you should always use the
higher value allowed by the coverage depth.


PROGRAM OPTIONS:
  1) Overlapping mode:
    -r
   --readsFile [string]     Reads file in FASTA or FASTQ format.
    -p
   --prefix [string]        Prefix for the output files. Default is "out".
    -M
   --minOverlap [int]       Minimum size of the overlaps to compute.
                            Default is 20.
    -t
   --truncate [int]         Discard n bases from the right end of the reads.

  2) Assembler mode:
    -e
   --edenaFile [string]     Edena overlap (.ovl) file.
    -p
   --prefix [string]        Prefix for the output files.
    -m
   --overlapCutoff [int]    Only consider overlaps >= than the specified size.
                            The default setting is 22. The optimal setting of
                            this parameter strongly depends on the coverage
                            that was achieved by the sequencing run. You should
                            therefore try different values in order to get the
                            optimal one
    -c
   --minContigSize [int]    Minimum size of the contigs to output.
                            Default is 100.
                            (too short contigs are more likely to contain some
                            base errors at the ends!
    -s
   --strict [boolean]       1: do not assemble ambiguities (default).
                            0: allow the program to make brutal choices, !but
                            at the risk of mis-assemblies!
    -d
   --depthLimit [int]       Minimum depth for a path to be valid. Default is 10
                            Changing this value is not recommended.
   --trim [int]             Coverage cutoff for contigs ending in gaps.
                            Contig ends ending in a gap may contain errors due
                            to low coverage. This option trim a few bases from
                            these ends until a minimum coverage is reached.
                            Default is 4. Ends are not trimmed if set to 1.
REPORT BUGS:
   david.hernandez@genomic.ch
