Getting Started
===============
Shellflow was designed for rapid developing of research workflow. If you
can write bash script, you don't have to learn a lot of new syntax. Only
You have to add brackets to annotate which files are input or output.
Before starting tutorial
------------------------
In this tutorial, softwares listed in below are required.
- `bwa `__
- `gatk4 `__
- `shellflow `__
Data listed in below are also required.
- Reference genome (for example:
`hs37d5 `__)
- BWA index of the reference genome (``bwa index hs37d5.fa``)
- Sequence Dictionary File of the reference genome
(``gatk --java-options "-Xmx4G" CreateSequenceDictionary -R hs37d5.fa``)
- Some sequnece data (for example:
`DRR002191 `__)
1st step: mapping
-----------------
A syntax of shellflow script is very similar to bash shell script. All
you have to do is enclose input files with double parenthesis (``((``
and ``))``) and output files with double brackets (``[[`` and ``]]``).
You can use pipe and redirect in workflow script like usual shell
script.
Content of ``gettingstarted.sf``
.. code:: bash
bwa mem -t 6 hs37d5.fa <(bzip2 -dc ((DRR002191_1.fastq.bz2))) <(bzip2 -dc ((DRR002191_2.fastq.bz2))) > [[DRR002191.sam]]
.. code:: bash
$ shellflow run gettingstarted.sf
2nd step: check status
----------------------
.. code:: bash
$ shellflow viewlog
#| State|Success|Failed|Running|Pending|File Changed|Start Date |Name
1| Done| 1| 0| 0| 0| Yes|2018/10/14 15:00:48|step1.sf
.. code:: bash
$ shellflow viewlog 1
Workflow Script Path: /home/okamura/Documents/Programming/GO/workspace/src/github.com/informationsea/shellflow/examples/getting-started/step1/step1.sf
Workflow Log Path: shellflow-wf/20181014-145901.507-step1.sf-1103fc92-e078-4e47-a316-62c4f16cb935
Job Start: 2018/10/14 15:00:48
Changed Input Files:
---- Job: 1 ------------
State: JobDone
Exit code: 0
Reusable: No
Script: bwa mem -t 6 hs37d5.fa DRR002191_1.fastq.bz2 DRR002191_2.fastq.bz2 > DRR002191.sam
Input: DRR002191_1.fastq.bz2 DRR002191_2.fastq.bz2
Output: DRR002191.sam
Dependent Job IDs:
Log directory: shellflow-wf/20181014-145901.507-step1.sf-1103fc92-e078-4e47-a316-62c4f16cb935/job001
.. code:: bash
$ ls shellflow-wf/20181014-145901.507-step1.sf-1103fc92-e078-4e47-a316-62c4f16cb935/job001
input.json local-run-pid.txt output.json rc run.sh run.stderr run.stdout script.sh script.stderr script.stdout
3rd step: add more commands
---------------------------
When you want to add a new command depends on previous command, add new
line at last. Shellflow automatically judge which commands depend on
other commands. Unlike Makefile, shellflow assumes all dependent
commands can be found before a command line.
.. code:: bash
bwa mem -R "@RG\tID:DRR002191\tSM:DRR002191\tPL:illumina\tLB:DRR002191" -t 6 hs37d5.fa <(bzip2 -dc ((DRR002191_1.fastq.bz2))) <(bzip2 -dc ((DRR002191_2.fastq.bz2))) > [[DRR002191.sam]]
gatk SortSam -I ((DRR002191.sam)) -O [[DRR002191-sorted.bam]] --SORT_ORDER coordinate
gatk MarkDuplicates -I ((DRR002191-sorted.bam)) -O [[DRR002191-markdup.bam]] -M [[DRR002191-markdup-metrics.txt]]
gatk BaseRecalibrator --known-sites ((common_all_20180423.vcf.gz)) -I ((DRR002191-markdup.bam)) -O [[DRR002191-bqsr.txt]] -R hs37d5.fa
Shellflow runs only added commands.
.. code:: bash
$ shellflow run gettingstarted.sf
4th step: use variable
----------------------
If a line starts with ``#%``, the line is parsed as flowscript, which is
embedded language of shellflow.
.. code:: bash
#% SAMPLE_ID = "DRR002191"
bwa mem -R "@RG\tID:{{SAMPLE_ID}}\tSM:{{SAMPLE_ID}}\tPL:illumina\tLB:{{SAMPLE_ID}}" -t 6 hs37d5.fa <(bzip2 -dc (({{SAMPLE_ID}}_1.fastq.bz2))) <(bzip2 -dc (({{SAMPLE_ID}}_2.fastq.bz2))) > [[{{SAMPLE_ID}}.sam]]
gatk SortSam -I (({{SAMPLE_ID}}.sam)) -O [[{{SAMPLE_ID}}-sorted.bam]] --SORT_ORDER coordinate
gatk MarkDuplicates -I (({{SAMPLE_ID}}-sorted.bam)) -O [[{{SAMPLE_ID}}-markdup.bam]] -M [[{{SAMPLE_ID}}-markdup-metrics.txt]]
gatk BaseRecalibrator --known-sites ((common_all_20180423.vcf.gz)) -I (({{SAMPLE_ID}}-markdup.bam)) -O [[{{SAMPLE_ID}}-bqsr.txt]] -R hs37d5.fa
5th step: use loop
------------------
.. code:: bash
for SAMPLE_ID in DRR002191 DRR002192; do
bwa mem -R "@RG\tID:{{SAMPLE_ID}}\tSM:{{SAMPLE_ID}}\tPL:illumina\tLB:{{SAMPLE_ID}}" -t 6 hs37d5.fa <(bzip2 -dc (({{SAMPLE_ID}}_1.fastq.bz2))) <(bzip2 -dc (({{SAMPLE_ID}}_2.fastq.bz2))) > [[{{SAMPLE_ID}}.sam]]
gatk SortSam -I (({{SAMPLE_ID}}.sam)) -O [[{{SAMPLE_ID}}-sorted.bam]] --SORT_ORDER coordinate
gatk MarkDuplicates -I (({{SAMPLE_ID}}-sorted.bam)) -O [[{{SAMPLE_ID}}-markdup.bam]] -M [[{{SAMPLE_ID}}-markdup-metrics.txt]]
gatk BaseRecalibrator --known-sites ((common_all_20180423.vcf.gz)) -I (({{SAMPLE_ID}}-markdup.bam)) -O [[{{SAMPLE_ID}}-bqsr.txt]] -R hs37d5.fa
done
.. code:: bash
#% SAMPLES = ["DRR002191", "DRR002192"]
for SAMPLE_ID in {{SAMPLES}}; do
bwa mem -R "@RG\tID:{{SAMPLE_ID}}\tSM:{{SAMPLE_ID}}\tPL:illumina\tLB:{{SAMPLE_ID}}" -t 6 hs37d5.fa <(bzip2 -dc (({{SAMPLE_ID}}_1.fastq.bz2))) <(bzip2 -dc (({{SAMPLE_ID}}_2.fastq.bz2))) > [[{{SAMPLE_ID}}.sam]]
gatk SortSam -I (({{SAMPLE_ID}}.sam)) -O [[{{SAMPLE_ID}}-sorted.bam]] --SORT_ORDER coordinate
gatk MarkDuplicates -I (({{SAMPLE_ID}}-sorted.bam)) -O [[{{SAMPLE_ID}}-markdup.bam]] -M [[{{SAMPLE_ID}}-markdup-metrics.txt]]
gatk BaseRecalibrator --known-sites ((common_all_20180423.vcf.gz)) -I (({{SAMPLE_ID}}-markdup.bam)) -O [[{{SAMPLE_ID}}-bqsr.txt]] -R hs37d5.fa
done
6th step: map all FASTQ in a directory
--------------------------------------
.. code:: bash
for FILENAME in *_1.fastq.bz2; do
#% SAMPLE_ID = basename(FILENAME, "_1.fastq.bz2")
bwa mem -R "@RG\tID:{{SAMPLE_ID}}\tSM:{{SAMPLE_ID}}\tPL:illumina\tLB:{{SAMPLE_ID}}" -t 6 hs37d5.fa <(bzip2 -dc (({{SAMPLE_ID}}_1.fastq.bz2))) <(bzip2 -dc (({{SAMPLE_ID}}_2.fastq.bz2))) > [[{{SAMPLE_ID}}.sam]]
gatk SortSam -I (({{SAMPLE_ID}}.sam)) -O [[{{SAMPLE_ID}}-sorted.bam]] --SORT_ORDER coordinate
gatk MarkDuplicates -I (({{SAMPLE_ID}}-sorted.bam)) -O [[{{SAMPLE_ID}}-markdup.bam]] -M [[{{SAMPLE_ID}}-markdup-metrics.txt]]
gatk BaseRecalibrator --known-sites ((common_all_20180423.vcf.gz)) -I (({{SAMPLE_ID}}-markdup.bam)) -O [[{{SAMPLE_ID}}-bqsr.txt]] -R hs37d5.fa
done