=========
Pipelines
=========

.. _create_new_pipeline:

Create a new pipeline
=====================
To create a new pipeline using ``Kronos``, you simply need to follow two steps:

1. generate a configuration file using :ref:`make_config <make_config>` command.
2. configure the pipeline by customizing the resulting :ref:`configuration file <config_file>`, i.e. pass proper values to the attributes in the :ref:`run subsection <run_sec>` and set the :ref:`connections <connections>`.

.. #. initialize the new pipeline using :ref:`init <init>` command

.. note::

    If you need a component that does not already exist, then you need to :ref:`make the component <develop_component>` first.

Examples
^^^^^^^^
Please refer to our Github `repositories <https://github.com/MO-BCCRC?tab=repositories>`_ for more examples. The repositories with the postfix ``_workflow`` are the pipelines and the rest are the components.


.. _launch_a_pipeline:

Launch a pipeline
=================
Essentially, you have two options to launch the pipelines generated by ``Kronos``:

1. (Recommended) use ``run`` command to initialize and run in one step.
2. use ``init`` command to initialize the pipeline first, then run the resulting Python script.

.. note::  
    Make sure the version of the ``Kronos`` package installed on your machine is compatible with the version used to generate the configuration file which is shown at the top of the configuration file in ``__PIPELINE_INFO__`` section:

    .. figure:: kronos_version.png
        :width: 500px
        :align: center
        :height: 200px
        :alt: alternate text
        :figclass: align-center
    ..   caption goes here


.. _how_to_run_pipeline:

1. Run the pipeline using ``run`` command
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
It is very easy to run a pipeline using ``run`` command:

.. code-block:: bash
    
    kronos run -k </path/to/my_pipeline_script.py> -c </path/to/components_dir> [options]
   
.. _options:

Input options of ``run`` command
********************************
This is the list of all the input options you can use with ``run`` command:

.. csv-table:: 
    :header: "Option", "Default", "Description"
    :widths: 20, 40, 60
    
    "**-h** or **--help**", "False", "print help - optional"
    "**-b** or **--job_scheduler**", "drmaa", "job scheduler used to manage jobs on the cluster - optional"
    "**-c** or **--components_dir**", "None", "path to :ref:`components_dir <components_dir>`- *required* "
    "**-d** or **--drmaa_library_path**", "lib/lx24-amd64/libdrmaa.so", "path to :ref:`drmaa_library <use_cluster>` - *optional* "
    "**-e** or **--pipeline_name**", "None", "pipeline name - *optional* "
    "**-i** or **--input_samples**", "None", "path to the input :ref:`samples file <samples_file>` - *optional* "
    "**-j** or **--num_jobs**", "1", "maximum number of simultaneous jobs per pipeline - *optional* "
    "**-k** or **--kronos_pipeline**", "None", "path to ``Kronos``-made :ref:`pipeline script <init>`- *optional* "
    "**-n** or **--num_pipelines**", "1", "maximum number of simultaneous running pipelines - *optional* "
    "**-p** or **--python_installation**", "python", "path to python executable - *optional* "
    "**-q** or **--qsub_options**", "None", "native qsub specifications for the cluster in a single string - *optional* "
    "**-r** or **--run_id**", "None (current timestamp will be used)", "pipeline :ref:`run id <run_id>` - *optional* "
    "**-s** or **--setup_file**", "None", "path to the :ref:`setup file <setup_file>`- *optional* "
    "**-w** or **--working_dir**", "current working directory", "path to the :ref:`working directory <working_dir>` - *optional* "
    "**-y** or **--config_file**", "None", "path to the :ref:`config_file.yaml <config_file>`- *optional* "
    "**--no_prefix**", "False", "switch off the prefix that is added to all the output files by Kronos - *optional* "

.. note::

    "**-c** or **--components_dir**" is required to specify.
  
.. _qsub_options:

On ``--qsub-options`` option
****************************
There are a few keywords that can be used with ``--qsub_options`` option. 
These keywords are replaced with corresponding values from the :ref:`run subsection <run_sec>` of each task when the job for that task is submitted:

- ``mem``: will be replaced with ``memory`` from run subsection 
- ``h_vmem``: will be replaced with 1.2 * ``memory``.
- ``num_cpus``: will be replaced with ``num_cpus`` from run subsection  

For example: 

.. code-block:: bash
        
    --qsub_options " -pe ncpus {num_cpus} -l mem_free={mem} -l mem_token={mem} -l h_vmem={h_vmem} [other options]" 

.. note::

    If you specify ``--qsub_options`` option with hard values (*i.e.* not using these keywords),  they will overwrite the values in the run subsection.

.. _init_using_run:

Initialize using ``run`` command
********************************
If you only have the configuration file and not the pipeline script, you can still use ``run`` command.
To do so, simply pass the configuration file using ``-y`` option.
This instructs ``Kronos`` to initialize the pipeline first and run the resulting pipeline script subsequently.
In this case, you do not have to specify ``-k`` option.

.. topic:: Tip

   You can use ``-s`` and ``-i`` when you use ``-y`` to input :ref:`sample file <samples_file>` and :ref:`setup file <setup_file>`, respectively.

.. warning::

   If you specify both ``-y`` and ``-k`` with ``run`` command, ``Kronos`` would use ``-y`` and ignores ``-k``.

.. note::

   When using ``run`` command, you cannot initialize only (i.e. without running the pipeline).
   Use ``init`` command if you only want to make a pipeline script.
   
.. _cloud:

Run the tasks locally, on a cluster or in the cloud
***************************************************
When launching a pipeline, each task in the pipeline can individually be run locally or on a cluster.
For this you need to use the :ref:`use_cluster` attribute for each task in the configuration file.

You can also launch the pipeline in the cloud. 
Please refer to the :ref:`deploy_kronos_to_the_cloud` for more information.
   
.. _how_to_init_pipeline:

2. Run the pipeline using ``init`` command and the resulting pipeline script
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You can launch a pipeline by using ``init`` command to create a pipeline script first:

.. code-block:: bash

    kronos -w </path/to/working_dir> init -y </path/to/config_file.yaml> -e <name_for_pipeline>
    
and then by :ref:`running the script <how_to_run_python_script>`.

The ``init`` command has the followig input options:
  
.. csv-table:: 
    :header: "Option", "Default", "Description"
    :widths: 20, 40, 60
    
    "**-h** or **--help**", "False", "print help - *optional*"
    "**-e** or **--pipeline_name**", "None", "pipeline name - *required* "
    "**-i** or **--input_samples**", "None", "path to the input :ref:`samples file <samples_file>` - *optional* "
    "**-s** or **--setup_file**", "None", "path to the :ref:`setup file <setup_file>`- *optional* "
    "**-y** or **--config_file**", "None", "path to the :ref:`config_file.yaml <config_file>`- *required* "

.. _samples_file:

Samples file
************
It is a tab-delimited file that lists the content of :ref:`SAMPLES <samples_sec>` section of the configuration file.
You can use the input option ``-i`` to pass this file when using ``init`` or ``run`` commands.

The content of the file should look like the following:

.. code:: bash

    #sample_id	<key1>	<key2>	...
    <id1>	<value1>	<value2>	...
    <id2>	<value3>	<value4>	...

where:

- the header always start with ``#sample_id`` and the rest of it, the ``<key>``'s, are the keys used in ``key:value`` pairs.
- the ``<id>``'s should be unique ID's, e.g. DAH498, Rx23D, etc.
- the ``<value>``'s are the corresponding values of the keys in the header.

For instance, the following is the content of an actual samples file:

.. code:: bash

    #sample_id	bam	output
    DG123	/genesis/extscratch/data/DG123.bam	DG123_analysis.vcf
    DG456	/genesis/extscratch/data/DG456.bam	DG456_analysis.vcf

If this file is passed to the ``-i`` option, the resulting configuration file would have a SAMPLES section looking like this:

.. code:: bash

    __SAMPLES__:
        DG123:
            output: 'DG123_analysis.vcf'
            bam: '/genesis/extscratch/data/DG123.bam'
        DG456:
            output: 'DG456_analysis.vcf'
            bam: '/genesis/extscratch/data/DG456.bam'

.. topic:: Info

    ``Kronos`` uses the samples file to *update* (not to overwrite) SAMPLES section which means that if an ID in the setup file already exists in the SAMPLES section of the configuration file, the value of the ID is updated.
    Otherwise, the new sample ID entry is added to the section and the rest of the section remains unchanged.

.. _setup_file:

Setup file
***********
It is a tab-delimited file that lists the ``key:value`` pairs that should go in :ref:`GENERAL <general_sec>` or :ref:`SHARED <shared_sec>` sections of the configuration file.
You can use the input option ``-s`` to pass this file when using ``init`` or ``run`` commands.

The content of the file should look like the following:

.. code:: bash

    #section    key    value
    <section_name>    <key1>    <value1>
    <section_name>    <key2>    <value2>

where:

- the header should always be: ``#section    key    value`` (tab-delimited).
- ``<section_name>`` can be either ``__GENERAL__`` or ``__SHARED__``.

For instance, the following is the content of an actual setup file:

.. code:: bash

    #section	key	value
    __GENERAL__	python	/genesis/extscratch/pipelines/apps/anaconda/bin/python
    __GENERAL__	java	/genesis/extscratch/pipelines/apps/jdk1.7.0_06/bin/java 
    __SHARED__	reference	/genesis/extscratch/pipelines/reference/GRCh37-lite.fa
    __SHARED__	ld_library_path	['/genesis/extscratch/pipelines/apps/anaconda/lib','/genesis/extscratch/pipelines/apps/anaconda/lib/lib']

If this file is passed to the ``-s`` option, the resulting configuration file would have GENERAL and SHARED sections looking like this:

.. code:: bash

	__GENERAL__:
	    python: '/genesis/extscratch/pipelines/apps/anaconda/bin/python'
	    java: '/genesis/extscratch/pipelines/apps/jdk1.7.0_06/bin/java'
	__SHARED__:
	    ld_library_path: "['/genesis/extscratch/pipelines/apps/anaconda/lib','/genesis/extscratch/pipelines/apps/anaconda/lib/lib']"
	    reference: '/genesis/extscratch/pipelines/reference/GRCh37-lite.fa'

.. topic:: Info

    ``Kronos`` uses the setup file to *update* (not to overwrite) GENERAL and SHARED sections which means that if a key in the setup file already exists in the target section, the value of that key is updated.
    Otherwise, the ``key:value`` pair is added to the target section and the rest of the pairs in the target section remain unchanged.

.. _how_to_run_python_script:

Run the pipeline script generated by ``init`` command
*****************************************************
All the pipeline scripts generated by ``Kronos init`` command can also be run as following:

.. code-block:: bash

    python <my_pipeline.py> -c </path/to/components_dir> [options]

where ``my_pipeline.py`` is the pipeline script you want to run.

.. warning:: 

	It is required to pass the path of the ``components_dir`` to the input option ``-c`` when running the pipeline.
	See `What is the components directory?`_ for more information on ``components_dir``.
..	You should also export the path to the ``PYTHONPATH`` environment variable as following:
    
    .. code-block:: bash

        export PYTHONPATH=$PYTHONPATH:</path/to/components_dir>

This is the list of all the input options you can use:

.. csv-table:: 
    :header: "Option", "Default", "Description"
    :widths: 20, 40, 60
    
    "**-h** or **--help**", "False", "print help - optional"
    "**-b** or **--job_scheduler**", "drmaa", "job scheduler used to manage jobs on the cluster - optional"
    "**-c** or **--components_dir**", "None", "path to :ref:`components_dir <components_dir>`- *required* "
    "**-d** or **--drmaa_library_path**", "lib/lx24-amd64/libdrmaa.so", "path to :ref:`drmaa_library <use_cluster>` - *optional* "
    "**-e** or **--pipeline_name**", "None", "pipeline name - *optional* "
    "**-j** or **--num_jobs**", "1", "maximum number of simultaneous jobs per pipeline - *optional* "
    "**-l** or **--log_file**", "None", "name of the log file - *optional* "
    "**-n** or **--num_pipelines**", "1", "maximum number of simultaneous running pipelines - *optional* "
    "**-p** or **--python_installation**", "python", "path to python executable - *optional* "
    "**-q** or **--qsub_options**", "None", "native qsub specifications for the cluster in a single string - *optional* "
    "**-r** or **--run_id**", "None (current timestamp will be used)", "pipeline :ref:`run id <run_id>` - *optional* "
    "**-w** or **--working_dir**", "current working directory", "path to the :ref:`working directory <working_dir>` - *optional* "
    "**--no_prefix**", "False", "switch off the prefix that is added to all the output files by Kronos - *optional*"

..    "**--draw_vertically**", "specify whether to draw the workflow plot vertically - *optional* "
..    "**--extension**", "specify the desired extension of the resulting workflow plot, e.g. pdf, jpeg, png - *optional* "
..    "**--no_key_legend**", "if True, hide the legend in the workflow plot - *optional* "
..    "**--print_only**", "if True, print the workflow plot. It only generates the workflow plot without running the pipeline - *optional* "
..    "**-s** or **--sample_id**", "sample ID - *optional* "
..    "**-v** or **--verbose**", "verbosity - *optional*"

.. _components_dir:

What is the components directory?
*********************************
It is the directory where you have cloned/stored all the components. 
The generated pipeline has the input option ``-c`` or ``--components_dir`` that requires the path to that directory. 

.. note::
    Note that ``components_dir`` is always the parent directory that contains the component(s). For example, if you have a component called ``comp1`` in the path ``~/my_components/comp1``, you should pass ``~/my_components`` to the ``-c`` option:


Results generated by a pipeline 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When a pipeline is run, a directory is made inside the :ref:`working directory <working_dir>` with its name being the :ref:`run ID <run_id>`.
All the output files and directories are stored here, i.e. in ``<working_dir>/<run_ID>/``.

.. _working_dir:

What is the working directory?
******************************
It is a directory used by ``Kronos`` to store all the resulting files.
The user can specify the path to its desired working directory via :ref:`input option <kronos_commands>` ``-w``.

.. topic:: Tip
    
    If the directory does not exist, then it will be made.

.. topic:: Tip

    If you do not specify the working directory, the current directory would be used instead.

.. _run_id:

What is the run ID?
*******************
Each time a pipeline is run, a unique ID is generated for that run unless it is specified using ``-r`` option by the user. 
This ID is used for the following purposes:

- to trace back the run, i.e logs, results, etc.
- to enable re-running the same incomplete run, which it will automatically pick up from where it left off
- to avoid overwriting the results if the same working directory is used for all the runs

.. topic:: Info

    The ID generated by ``Kronos`` (if ``-r`` not specified) is a timestamp: 'year-month-day_hour-minute-second'. 

.. _results_dir:

What is the structure of the results directory generated by a pipeline?
***********************************************************************
The following tree shows the general structure of the ``<working_dir>/<run_ID>/`` directory where the results are stored: 

.. code-block:: bash

    <working_dir>
    |-- <run_id>
    |   |-- <sample_id1>_<pipeline_name>
    |   |   |-- logs
    |   |   |-- outputs
    |   |   |-- scripts
    |   |   |-- sentinels
    |   |-- <sample_id2>_<pipeline_name>
    |   |   |-- logs
    |   |   |-- outputs
    |   |   |-- scripts
    |   |   |-- sentinels
    |   |-- <pipeline_name>_<run_id>.yaml
    |   |-- <pipeline_name>_<run_id>.log

where:

- an individual subdirectory is made with name ``<sample_id>_<pipeline_name>`` for each sample in the :ref:`SAMPLES section <samples_sec>`.
- there are always the following four subdirectories in the ``<sample_id>_<pipeline_name>`` directory:
    - :file:`logs`: where all the log files are stored 
    - :file:`outputs`: where all the resulting files are stored
    - :file:`scripts`: where all the scripts used to run the components are stored
    - :file:`sentinels`: where all the sentinel files are stored
 
If there is not any samples in the SAMPLES section, then a subdirectory with name ``__shared__only___<pipeline_name>`` is made instead of ``<sample_id>_<pipeline_name>``.
In fact, since there are no ID's in the SAMPLES section, ``Kronos`` uses the string ``__shared__only__`` to idicate that SAMPLES section is empty.

.. note::

    The developer of the pipeline can customize the content of the :file:`outputs` directory (see :ref:`output_dir_customization` for more information). 
    So, you might see more directories inside that directory.

.. topic:: Info

    ``scripts`` direcotry is used by ``Kronos`` to store and manage the scripts and should not be modified.
    
.. topic:: Info
    
    Sentinel files mark the successful completion of a task in the pipeline. 
    ``sentinels`` directory is simply used for stoing these files.

.. _relaunch:

How can I relaunch a pipeline?
******************************
If you have run a pipeline and it has stopped at some point for any reason, e.g. a breakpoint or an error, you can re-run it from where it left off.
For that purpose, simply use the exact same command you used in the first place but only make sure that you also pass the :ref:`run ID <run_id>` of the first run to the input option ``-r``. 

.. note::

    If you forget to pass the run ID or pass a nonexistent run ID by mistake, ``Kronos`` considers that as a new run and launches the pipeline from scratch.
    This will not overwrite your previous results.

.. topic:: Tip

   If you want to relaunch a pipeline from an arbitrary task (that already has a sentinel file), you need to go to the :ref:`sentinels directory <results_dir>` and delete the sentinel file corresponding to that task. 
   Then relaunch the pipeline as mentioned above.
   Remember that all the next tasks that have connections to this task will also be re-run regardless of whether or not they have a sentinel file.
   The reason for this is that ``Kronos`` checks the timestamp of the sentinels and if the sentinels of the next task are outdated compared to the current task, it will re-run them too.
    
.. topic:: Tip

    If you want to run a part of a pipeline between two tasks (two breakpoints) for several times, each time you need to delete the sentinel files of the tasks between the two breakpoints as well as the sentinel file of the second breakpoint.
    In the new version, we're working on making this easier by eliminatig the need to delete these sentinels each time. 

.. topic:: Tip

    A sentinel file name looks like ``TASK_i__sentinel_file``.
    For the breakpoints, the sentinel file name looks like ``__BREAK_POINT_TASK_i__sentinel_file``.