1.Introduction of NGS-FC

1.1 What can you do with NGS-FC

Due to the rapid development of next-generation sequencing (NGS), a variety of new databases have been built to store the increasing number of data and a large number of tools have been designed to analyze these data. However, most of these tools define their own formats for data representation and storage. The diversity of these data formats may be attributable to differences in instruments or vendors, sequencing principles, and the development background or aim, such as readability, integration, or space saving, as well as other factors.

In order to make full use of the NGS data, many analysis tools integrating format conversion functions have been developed over the past years, such as Samtools , NGSUtils, Picard, etc. These tools mainly support some formats or are not cross-platform. There is no generic framework to support format conversion in NGS fields yet. System biology research face the same problem, the System Biology Format Converter (SBFC) is developed to provide a generic framework to support format conversion. So we attempt to design a format converter similar to SBFC in NGS fields.

We developed open source software entitled NGS-FC (Next Generation Sequencing Format Converter), which is a framework to potentially support the conversion between different NGS formats. It is easy to use and allows users to stay away from the troublesome formats conversion. It is written in platform independent language Java and supports 14 formats now and its function can be easily expanded.

We integrated some well-known conversion scripts and also developed new ones, and now 14 formats are supported, as shown in Figure 1.
format conversion supported by NGS-FC

Fig. 1. Format conversion supported by NGS-FC.
The blue ellipse is sequence or quality score format, the yellow ellipse is alignment format,
the green ellipse is sequence annotation and visualization format.

The NGS-FC has five main functions in brief:
1. Format conversion(call build-in scripts and external scripts)
2. Suppport GUI and command line
3. Support users to add new converters
4. Batch processing
5. Platform-independent

1.2 System requirements and installation
1.2.1 Computer requirements
In order to run the tool smoothly we recommend that computer RAM > 4G, Hard Disk > 120G.
- Mac users go to "About this Mac" and see Memory and Storage.
- Window user go to "Control Panel->System" to see the Installed memory(RAM) and "This PC" to see the "Local Disk" properties.
- Linux users run "free" or "cat /proc/meminfo" in the terminal to see the memory and run "df -hl" to see the storage.

1.2.2 Environment set-up
- Install Java SE Development Kit 1.6 or above.(http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)
- Install NetBeans IDE 8.2 or above for software developer. (https://netbeans.org/)
- Test these installations are working by:
1)Mac: Open terminal, run "java -version" command. If the result like the figure below means the java is installed successfully.

2)Windows: open command, run "java -version" command. If the result like the figure below means the java is installed successfully.

3)Linux: open terminal, run "java -version" command. If the result like the figure below means the java is installed successfully.


About NetBeans IDE:
When NetBeans IDE is installed, start it. If it can be run means it is installed successfully.
Figure 5 shows the interface of NetBeans IDE.

1.2.3 NGS-FC installation
We have tested it on Win7, Win10, Mac OS X version 10.9, Ubuntu 4.8.4(Linux).
1)Download NGS-FC(build).rar from 4.1 Source Code.
2)Unzip NGS-FC(build).rar. The result like the figure shows below.
File "NGS-FC.jar" is the main program.
Folder "Scripts" is the Outer Script folder. ( reference 3.2 Add external scripts for detail )
Folder "TestData" contains serverl test data files. ( reference 4.2 Test Data for detail )

Start the NGS-FC.jar. If the main interface like figure 2 show means the tool is installed successfully.


1.2.4 Installation test
After installation we can test whether NGS-FC is in good condition.
Step 1: Select Build-in Script tab.
Step 2: Select the input format "FASTA" in the list.
Step 3: Select the output format "FASTQ" in the list.
Step 4: Add a FASTA file from TestData folder.
Step 5: Select the output directory.
Note: NGS-FC will name the output file automatically. For more information please refer to Appendix B :Output naming rules.
Step 6: Press the start button.
Step 7: If the "System Message" shows conversion finished messgae, then check the output file.
Step 8: If the output file is OK means the installation is success.

2.Quick Start to Use NGS-FC
2.1 Use NGS-FC in command line
2.1.1 Synopsis
(1) The general syntax is:
java -jar NGS-FC.jar -i <input> -o <output> -t <type> [arguments]
The order of arguments is not mandatory, but the arguments must be in pair.

(2) Use config file to set the arguments.
java -jar NGS-FC.jar -c <config file>
The config file includes all the configure information.
For example, the following is for the conversion of BAM to FASTQ:
inputs= #call external scrpts inputs set to empty
output= #call external scrpts inputs set to empty
script=samtools fastq #the external conversion tool
args=in.bam #the conversion arguments
type=-bm2fq #the conversion type

(3) Show help.
java -jar NGS-FC.jar -h

(4) Use build-in Scripts to convert FASTQ to FASTA.
java -jar NGS-FC.jar -i data.fastq -o <dir> -t -fq2fq
Conversion type:
-fa2fq FASTA to FASTQ
-fa2fa CSFASTA to FASTA
-sm2fa SAM to FASTA
-sm2fq SAM to FASTQ
etc.
For more infomation of the conversion type , please see Appendix A.

(5) Use external scripts to convert SAM to FASTQ.
java -jar NGS-FC -s picard.jar SamToFastqi=<dir>\data.sam FASTQ=<dir>\output.fastq SECOND_END_FASTQ=<dir>\output_pairEnd.fastq

2.1.2 Commands and options

-i
Input file.

-o
Output directory.

-s
Script file if use external scripts.

-t
Convertion type.

-c
Config file.

Config file example:
inputs=data.fa #the input files, multiple file can be seperate by ",", it is used by build-in scripts
output=/home/houhuabin/yu/Publish20161208/out #the output directory, it is used by build-in scripts
script = /home/houhuabin/yu/Publish20161208/faToTwoBit #if use external script,write the script file path
type = -fa22bit #conversion type,for more information,please refer to Appendix A.
args = data.fa hs.2bit #arguments for conversion,use space to separate.

-h
Show help.

2.1.3 Command Example

2.1.3.1 Use build-in scripts to convert FASTQ to FASTA
java -jar NGS-FC.jar -i SRR001030.fastq -o out -t -fq2fa

2.1.3.2 Use external scripts to convert FASTA to 2Bit
java -jar NGS-FC.jar Cc config.ini

the config.ini file content is
inputs=
output=
script = /home/houhuabin/yu/Publish20161208/faToTwoBit
args = data.fa hs.2bit
type = -fa22bit

2.2 Use the GUI of NGS-FC
2.2.1 Overview

NGS-FC GUI has 8 parts. Please see figure 2.
(1)The Menu contains File, Convert, Tools and Help. We can use menu to add files, start conversion, etc.
(2)The Tool Bar contains Add, AddAll, Remove, RemoveAll, Output and Start. we can use tool bar to add files, remove files, start conversion, etc.
(3)The File Selector is the directory we can select files. Use mouse right click, we can add file by popup menu.
(4)The Conversion Type and Parameters contains conversion type and parameters setting. We can select build-in scripts and outer scripts.
(5)The Conversion Files contains file list we add.
(6)The Output Directory is the directory we set to output files.
(7)The System Messages is used to show action message.
(8)The System State is used to show the system state. In file converting progress bar will be shown.

overview of NGS-FC
Fig.2. Overview of NGS-FC

2.2.2 Conversion by build-in scripts

The interface of build-in scripts see figure 3.
Use build-in scripts

Fig. 3. Interface of build-in scripts

Step 1: Select Build-in Script tab.
Step 2: Select the input format in the list.
Step 3: Select the output format in the list.
Step 4: Add the files you want to process.
Step 5: Select the output directory.
Note: NGS-FC will name the output file automatically. For more information please refer to Appendix B :Output naming rules.
Step 6: Input parameters if required.
Step 7: Press the start button and have a coffee,it will done soon.

2.2.3 Use external scripts
The interface of outer scripts see figure 4.
Use build-in scripts

Fig. 4. Interface of outer scripts

If NGS-FC doesn't have the conversion scripts you want or you want to execute a program(for example a analysis program), follow the following steps.
Step 1: Select Outer Scripts tab.
Step 2: Select the script or program in the list(NGS-FC will load all the scripts(pl,py,exe,jar...) in the scripts directory.
Note: If you want to execute a perl script,you must have a perl interpreter installed.Other scripts is similar.
Step 3: Input parameters.
Step 4: Press the start button and have a coffee,it will done soon.

3.Developer Guide
3.1 Add or change build-in scripts
NGS-FC was written in Java. We used NetBeans IDE 8.1 to edit our project. If users want to change our code, NetBeans IDE is needed to be installed firstly. Table 1 lists the packages and classes we designed for NGS-FC based on OOP.

Table 1 NGS-FC Packages and Classes
Package

Class

Description

cn.edu.suda.converter

Converter
BedToBedgraph
BowtieToSam.
SamToFastq
SamToFasta
FastaToTwoBit
TwoBitToFasta

Classes of format converter. Abstract class Converter is the base class of all other conversion classes.

cn.edu.suda.database

DataBase.java
DataBaseLoader.java
DataSearch.java

Classes of database search.

cn.edu.suda.format

NGSFormat.java
BedFormat.java
BedGraphFormat.java

Classes of formats. Class NGSFormat is the base class of all other format classes.

cn.edu.suda.gui

MainFrame.java
DatabaseFrame.java
DataBaseList.java

Classes of GUI.

cn.edu.suda.io

FormatReader.java
FormatWriter.java
BedReader.java
BowtieReader.java

Classes of IO. Abstract class FormatReader is the base class of all other format reader classes. Abstract class FormatWriter is the base class of all other format writer classes.

cn.edu.suda.log

DocumentHandler.java
RecordFormatter.java

Classes of log.

cn.edu.suda.manager

Manager.java
RunThread.java
RunScript.java

Classes of manager. Class RunThread is used to call the build-in conversion classes. Class RunScript is used to call the external scripts.

cn.edu.suda.util

ByteUtil.java
ColorSpaceConverter.java
FileUtil.java
QualityEncoding.java

Classes of utilities. Class Quality Encoding is used to convert quality values among Sanger, Solexa and Illumine.

ngsformatconverter

Launcher

Entrance of the program.


3.1.1 Change existing conversion code
If users are not satisfied with our converters, they can change the code.
The format convert codes are in package cn.edu.suda.converter.The format definition classes are in package cn.edu.suda.format.
The format reader and writer classes are in package cn.edu.suda.io.
For example, if users want to change the code from FASTQ to 2Bit.
Uers need to change the cn.edu.suda.converter.FastaToTwoBit class, cn.edu.suda.format.TwoBitFormat, cn.edu.suda.format.TwoBitIndex, cn.edu.suda.format.TwoBitSequence, cn.edu.suda.io.TwoBitReader and others classes used by these classes.

3.1.2 Add new conversion code
If new format need to be integrated into NGS-FC, users need to add new conversion code.
The steps are:
Step 1: Change the method init in cn.edu.suda.gui.MainFrame.java to add the conversion type.
Step 2: Change the method getConverter in cn.edu.suda.converter.Converter.
Step 3: Add new format class in package cn.edu.suda.format to define the new format.
Note:
The new fomat class must inherit cn.edu.suda.NGSFormat class.
Step 4: Add new format file reader and writer class in package cn.edu.suda.io.
Note:
The reader class must inherit cn.edu.suda.io.FormatReader class.
The writer class must inherit cn.edu.suda.io.FormatWriter class.
Step 5: Add new converter class in package cn.edu.suda.converter.
Note:
The new class must inherit cn.edu.suda.Converter class.

3.2 Add external scripts
Users are able to copy the external scripts to "scripts" directory of NGS-FC or use the Add button in Outer Scripts tab.
The Add button will copy the external scripts to "scripts" directory.
The run environment of external scripts needs to be installed before it is called.

4. Source and Test Data
4.1 Source Code
Download the source code. SOURCE CODE and Execuable File.
When the source code is unzipped and imported to NetBeans IDE, it will show like Figure 5.

NetBeans IDE
Fig. 5. The NetBeans IDE of NGS-FC

4.2 Test Data
4.2.1 Test data based on the format specification

In the source code there is a TestData folder. The test data files are as figure 6. The data is small we can use them to validate our program.
Test data files
Fig. 6. Test Data Files

In the source code there is a FormatSpecification folder. The format specification files are as figure 7.
format specification files
Fig. 7. Format Specification Files

4.2.2 Test data download from NCBI SRA
we downloaded five real datasets with different NGS formats from NCBI SRA database (https://www.ncbi.nlm.nih.gov/sra).
FASTQ(SRR001030.fastq (130M))
Qseq(SRR1145848.qseq (323M))
BED(mapt.bed (16M))
BAM(bwa.sort.bam (96M))
FASTA(chr1.fasta (22M))

4.3 Test Results
We use these data to do the performances test. The result is in table 2.
We use Ubuntu 4.8.4 to do the test.

Table 2 Performance Comparison between NGS-FC and Other Existing Converters
Format Conversion

Dataset (Size:M)

Software (Command)

RAM Usage(M)

Running Time(s)

FASTQ to FASTA

SRR001030.fastq (130)

NGS-FC

446

4

fastqutils (tofasta)

30

13

Qseq to FASTQ

SRR1145848.qseq (323)

NGS-FC

65

17

fastqutils (fromqseq)

6

44

BED to BedGraph

mapt.bed (16)

NGS-FC

287

3

bedutils (tobedgraph)

498

13

BAM to FASTQ

bwa.sort.bam (96)

NGS-FC

11

2

bamutils (tofastq)

74

34

FASTA to 2bit

chr1.fa (22)

NGS-FC

719

12

BLAT (twoBitToFa )

30

8

Appendix A :Conversion type
-fa2fq
FASTA to FASTQ.
The qual file must have the same file name with the fasta file and suffix ".qual". And in the same directory with fasta file.

-fa2fa
CSFASTA to FASTA.

-fa22bit
FASTA to 2bit.

-fq2fa
FASTQ to FASTA.

-fq2fq
Change FASTQ quality score. 0 for Sanger, 1 for Solexa, 2 for Illumina 1.3-1.4, 3 for Illumina 1.5+.
The arguments are listed in table 3.

Table 3 The arguments of FASTQ to FASTQ
Arguments

Description

Input score

Input file quality score,required.

Output score

Output file quality score,required.

-fq2sm
FASTQ to SAM, more information please refer to Picard(http://picard.sourceforge.net/command-line-overview.shtml#FastqToSam).
The arguments are listed in table 4.

Table 4 The arguments of FASTQ to SAM
Arguments

Description

FASTQ2

Input fastq file (optionally gzipped) for the second read of paired end data. Default value: null.

Quality format Quality score format,sanger,solexa,Illumina 1.3+ or Illumina 1.5+,default value:sanger.
Read group name Read group name Default value:A.
Sample name Sample name to insert into the read group header.Default value:SampleName.
Library name The library name to place into the LB attribute in the read group header Default value:null.
Platform unit The platform unit (often run_barcode.lane ) to insert into the read group header Default value: null.
Platform The platform type (e.g.illumina,solid)to insert into the read group header Default value: null.
Sequence center The sequencing center from which the data originated. Default value: null.
Predicted insert size
Predicted median insert size, to insert into the read group header.Default value: null.

Description

Inserted into the read group header. Default value: null.

-qseq2fq
Qseq to FASTQ.

-qseq2fa
Qseq to FASTA.

-sc2fq
SCARF to FASTQ.

-sc2fa
SCARF to FASTA.

-2bit2fa
2bit to FASTA.The arguments are listed in table 5.

Table 5 The arguments of 2bit to FASTA
Arguments

Description

In one file

Put all the sequences in one FASTA file or in separated files.Default value:false.

-sm2bm
SAM to BAM.

-sm2fa
SAM to FASTA. The arguments are listed in table 6.

Table 6 The arguments of SAM to FASTA
Arguments

Description

Second end FASTA

Output fasta file (if paired, second end of the pair fasta).Default value:null. Cannot be used in conjuction with option(s) OUTPUT_PER_RG (OPRG)

Output per read group Output a fastq file per read group (two fastq files per read group if the group is paired). Default value: false. Possible values: {true, false} Cannot be used in conjuction with option(s) SECOND_END_FASTQ (F2) FASTQ (F)
Re-reverser Re-reverse bases and qualities of reads with negative strand flag set before writing them to fastq Default value: true. Possible values: {true, false}
Include non-PF reads Include non-PF reads from the SAM file into the output FASTQ files. Default value: false. Possible values: {true, false}
Clipping attrbutes The attribute that stores the position at which the SAM record should be clipped Default value: null.
Clipping action The action that should be taken with clipped reads: 'X' means the reads and qualities should be trimmed at the clipped position; 'N' means the bases should be changed to Ns in the clipped region; and any integer means that the base qualities should be set to that value in the clipped region. Default value: null.
Read1 trim The number of bases to trim from the beginning of read 1. Default value: 0. This option can be set to 'null' to clear the default value.
Read1 max bases to write The maximum number of bases to write from read 1 after trimming. If there are fewer than this many bases left after trimming, all will be written. If this value is null then all bases left after trimming will be written. Default value: null.
Read2 trim
The number of bases to trim from the beginning of read 2. Default value: 0. This option can be set to 'null' to clear the default value.

Read2 max bases to write

The maximum number of bases to write from read 2 after trimming. If there are fewer than this many bases left after trimming, all will be written. If this value is null then all bases left after trimming will be written. Default value: null.

-sm2fq
SAM to FASTQ, more information please refer to Picard(http://picard.sourceforge.net/command-line-overview.shtml#SamToFastq).

The arguments is the same as -sm2fa.

-bm2sm
BAM to SAM.

-bm2fa
BAM to FASTA, the arguments is the same as -sm2fa.

-bm2fq
BAM to FASTQ, the arguments is the same as -sm2fq.

-psl2sm
PSL to SAM. The arguments are listed in table 7.

Table 7 The arguments of PSL to SAM
Arguments

Description

Match

The score that matches,default value: 1.

Mismatch The score that mismatch, default value: 2.
Open The score for gap open, default value: 5.
Extension The score for gap extension, default value: 2.

-bw2sm
Bowtie to SAM.

-bd2bg
BED to BedGraph.

-wg2bd
Wig to BED.

-wg2bg
Wig to BedGraph.

Appendix B :Output naming rules

The output name follows the principles below:
Output file name is the same with input file and with the output format suffix.
If the above name is existed,then "_1" will add to the end of output name.The number will increase when there is still the same with anothor one.
So that,the output file will not overwrite any existed file.
These rules only effect on the build-in scripts.