Components

This part explains how to make new components. You need to know the following definitions first:

seed: a seed is a computer application, a program or in general a command line tool that performs a specific task. This can be a simple bash copy command or a software suite like MutationSeq or Strelka.

component: a component is a wrapper around a seed that makes the seed compatible with kronos so that the seed then can be used as a part of a pipeline. In other words, components are the building blocks of the pipelines generated by kronos.

Develop a component

The purpose of components is to modularize workflows with reusable building blocks that require minimal development. The number of lines of codes for making a new component is very small. The simple development instructions eliminate, for example, the need to use Ruffus decorators, input/output management using regex expressions and complicated dependency management in the code that can easily become very complex with the number of tasks in a workflow. Furthermore, a large workflow can be divided into a set of small components that results in a much faster and manageable workflow development.

All command line tools can be used as seeds and therefore wrapped as Kronos components. Regardless of how complicated they are, their corresponding components have a standard directory structure composed of specific wrappers and sub-directories. The wrappers are agnostic to the programming language used for developing the seed. The components should be developed prior to making the workflow. However, since they are individually and independently developed and due to their reusability, the development of a component happens only once and then it can be used in various pipelines.

In order to develop a component you need to:

  1. create a directory with a name that is the same as the name you want for the component, e.g. my_comp.

  2. create a directory called “component_seed” inside the my_comp directory and copy the seed source code into it.

  3. create the following files inside the my_comp directory:

Tip

All the above files and directories are generated by Kronos make_component command. The user only needs to customize them.

Tip

The seed source code is not required to be copied into the component_seed directory. Instead, the seed can be used as a requirement for the component which can be listed in the component_reqs.py.

The component directory tree looks like the following:

|-- <component_name>
|   |-- __init__.py
|   |-- component_main.py
|   |-- component_params.py
|   |-- component_reqs.py
|   |-- component_ui.py
|   |-- component_seed

Note

<component_name> should be replaced with the actual name of the component, e.g. my_comp. The rest of the file and directory names should be exactly as shown above.

Tip

It is recommended to add the following files and directories (not generated automatically) as well:

  • component_test: a directory where all the test files exist.
  • README: a readme file to provide more information about the component.

Component_main

The core of a component is the component_main.py python script. This module defines Component class which extends the ComponentAbstract class.

Using the make_component command, the following component_main.py file is generated:

"""
component_main.py
This module contains Component class which extends
the ComponentAbstract class. It is the core of a component.

Note the places you need to change to make it work for you.
They are marked with keyword 'TODO'.
"""

from kronos.utils import ComponentAbstract
import os


class Component(ComponentAbstract):

    """
    TODO: add component doc here.
    """

    def __init__(self, component_name="my_comp",
                 component_parent_dir=None, seed_dir=None):

        ## TODO: pass the version of the component here.
        self.version = "v0.99.0"

        ## initialize ComponentAbstract
        super(Component, self).__init__(component_name,
                                        component_parent_dir, seed_dir)

    ## TODO: write the focus method if the component is parallelizable.
    ## Note that it should return cmd, cmd_args.
    def focus(self, cmd, cmd_args, chunk):
        pass
    #    return cmd, cmd_args

    ## TODO: this method should make the command and command arguments
    ## used to run the component_seed via the command line. Note that
    ## it should return cmd, cmd_args.
    def make_cmd(self, chunk=None):
        ## TODO: replace 'comp_req' with the actual component
        ## requirement, e.g. 'python', 'java', etc.
        cmd = self.requirements['comp_req']

        cmd_args = []

        args = vars(self.args)

        ## TODO: fill the following component params to seed params dictionary
        ## if the name of parameters of the seed are different than
        ## component parameter names.
        comp_seed_map = {
                         #e.g. 'component_param1': 'seedParam1',
                         #e.g. 'component_param2': 'seedParam2',
                        }

        for k, v in args.items():
            if v is None or v is False:
                continue

            ## TODO: uncomment the next line if you are using
            ## comp_seed_map dictionary.
            # k = comp_seed_map[k]

            cmd_args.append('--' + k)

            if isinstance(v, bool):
                continue
            if isinstance(v, str):
                v = repr(v)
            if isinstance(v, (list, tuple)):
                cmd_args.extend(v)
            else:
                cmd_args.extend([v])

        if chunk is not None:
            cmd, cmd_args = self.focus(cmd, cmd_args, chunk)

        return cmd, cmd_args

## To run as stand alone
def _main():
    c = Component()
    c.args = component_ui.args
    c.run()

if __name__ == '__main__':
    import component_ui
    _main()

Note

Note the places you need to change the generated file to make it work for you are marked with keyword ‘TODO’.

There are two methods in this file that you need to customize:

focus method

Each parallelizable component will require a focus method. The purpose of this method is to tell the component to process only one chunk of the input data rather than the entire file. How this is done will vary depending on the component, but basically will add to, or alter the component command to this end. For example, in the following implementation, focus method simply passes the chunk to the --interval option in the command arguments cmd_arg (most of the time, this implementation does the job):

focus(cmd, cmd_args, chunk):
    cmd_args.append('--interval ' + chunk)
    return cmd, cmd_args

Note

You need to implement focus method only if the component is parallelizable.

make_cmd method

All the components should implement this method in their component_main.py. This method essentially returns the command string that one can use to run the seed on a command line. For example, if the seed can be run using the following command:

python my_seed_command.py --foo data1 --bar data2

then make_cmd method would look like this (note that we only need to change the first two lines of the default file made by kronos):

def make_cmd(self, chunk):
    path = os.path.join(self.seed_dir, 'my_seed_command.py')
    cmd = self.requirements['python'] + ' ' + path

    cmd_args = []

    args = vars(self.args)

    ## TODO: fill the following component params to seed params dictionary
    ## if the name of parameters of the seed are different than
    ## component parameter names.
    comp_seed_map = {
                     #e.g. 'component_param1': 'seedParam1',
                     #e.g. 'component_param2': 'seedParam2',
                    }

    for k, v in args.items():
        if v is None or v is False:
            continue

        ## TODO: uncomment the next line if you are using
        ## comp_seed_map dictionary.
        # k = comp_seed_map[k]

        cmd_args.append('--' + k)

        if isinstance(v, bool):
            continue
        if isinstance(v, str):
            v = repr(v)
        if isinstance(v, (list, tuple)):
            cmd_args.extend(v)
        else:
            cmd_args.extend([v])

    if chunk is not None:
        cmd, cmd_args = self.focus(cmd, cmd_args, chunk)

    return cmd, cmd_args

Tip

In the above example, python is a requirement for the component and should be added to the component_reqs.py of the component. Also, parameters foo and bar should be added to the component_params.py.

ComponentAbstract class

This class comprises of the following attributes and methods:

Attributes:

Attribute Description
args the argparse namespace containing all the input arguments from the Component_ui module
components_dir path to the directory where the component exists
component_name name of the component - specific
component_params Component_params module of the component
component_reqs Component_reqs module of the component
env_vars see Component_reqs
memory see Component_reqs
parallel see Component_reqs
requirements see Component_reqs
seed_dir path to the directory where the seed exists. Most of the time it is <component_name>/<component_seed>
seed_version version of the seed
version version of the component - specific

Tip

specific means it should be assigned value when implementing a component.

Methods:

Method Description
__init__ initialize general attributes that each component must have
run run the component command locally
focus update the command and command arguments for each chunk - virtual
make_cmd make the command used to run the seed of the component. This returns the same command that one would use to run the component as a stand alone program via command line - virtual
test run unittest of the component - virtual

Tip

The class can be imported from utils module from the kronos package:

from kronos.utils import ComponentAbstract

Component_params

This is a python module and contains the following information:

  • input_files: a dictionary with keys being the input file parameters and the values being the default values or a proper flags based on the component UI. For example:
input_files={'samples':['tumour:__REQUIRED__',
                        'normal:__REQUIRED__',
                        'reference:__REQUIRED__',
                        'model:__REQUIRED__'
                        ],
             'config':'some_default.cfg',
             'positions_file':None
             }

Note

This dictionary includes only parameters that expect input files or directories.

  • output_files: a dictionary with keys being the output file parameters and the values being the default values or a proper flags based on the component ui. For example:
output_files = {'export_features':None,
                'log_file':'mutationSeq_run.log',
                'out':None
                }

Note

This dictionary includes only parameters that expect output files or directories.

  • input_params: a dictionary with keys being the input non_file parameters and the values being the default values or a proper flags based on the component ui. For example:
input_params = {'all':'__FLAG__',
                'buffer_size':'2G',
                'coverage':4,
                'deep':'__FLAG__',
                'interval':None,
                'no_filter':'__FLAG__'
                }

Note

All other parameters that are not included in input_files and output_files should be listed in input_params.

Using the make_component command, the following component_params.py file is generated:

"""
component_params.py

Note the places you need to change to make it work for you.
They are marked with keyword 'TODO'.
"""

## TODO: here goes the list of the input files. Use flags:
## '__REQUIRED__' to make it required
## '__FLAG__' to make it a flag or switch.
input_files  = {
#                 'input_file1' : '__REQUIRED__',
#                 'input_file2' : None
                }

## TODO: here goes the list of the output files.
output_files = {
#                 'output_file1' : '__REQUIRED__',
#                 'output_file1' : None
                }

## TODO: here goes the list of the input parameters excluding input/output files.
input_params = {
#                 'input_param1' : '__REQUIRED__',
#                 'input_param2' : '__FLAG__',
#                 'input_param3' : None
                }

## TODO: here goes the return value of the component_seed.
## DO NOT USE, Not implemented yet!
return_value = []

Example: This is an example showing the content of a component_params.py file:

input_files  = {'tumour':'__REQUIRED__',
                'normal':'__REQUIRED__',
                'reference':'__REQUIRED__',
                'model':'__REQUIRED__',
                'config':'metadata.config',
                'positions_file':None
                }

output_files = {'export_features':None,
                'log_file':'mutationSeq_run.log',
                'out':'__REQUIRED__'
                }

input_params = {'all':'__FLAG__',
                'buffer_size':'2G',
                'coverage':4,
                'deep':'__FLAG__',
                'interval':None,
                'no_filter':'__FLAG__',
                'normalized':'__FLAG__',
                'normal_variant':25,
                'purity':70,
                'mapq_threshold':20,
                'baseq_threshold':10,
                'indl_threshold':0.05,
                'manifest':'__OPTIONAL__',
                'single':'__FLAG__',
                'threshold':0.5,
                'tumour_variant':2,
                'features_only':'__FLAG__',
                'verbose':'__FLAG__',
                'titan_mode':'__FLAG__'
                }

Component_reqs

This is a python module and contains the following information:

  • env_vars: a dictionary with keys being the name of environment variables and values being the path/content to export. The values can be updated in the configuration file using env_var in the run subsection. Therefore, it is recommended not include the paths as values in this file and instead use an empty list, [], or None as a value.
  • memory: specifies the minimum memory required by the component to properly run on a cluster. The format is nG, e.g. 30G.
  • parallel: a boolean flag that specifies whether or not a component can run in parallel mode.
  • requirements: a dictionary with keys usually being the name of a program/software and values being None or the flag __REQUIRED__ . The values will be later updated by kronos using the content of the __GENERAL__ section.
  • seed_version: the version of the seed.
  • version: the version of the component.

Using the make_component command, the following component_reqs.py file is generated:

"""
component_reqs.py

Note the places you need to change to make it work for you.
They are marked with keyword 'TODO'.
"""

## TODO: here goes the list of the environment variables, if any,
## required to export for the component to function properly.
env_vars = {
#            'env_var1' : ['value1', 'value2'],
#            'env_var2' : 'value3'
            }

## TODO: here goes the max amount of the memory required.
memory = '5G'

## TODO: set this to True if the component is parallelizable.
parallel = False

## TODO: here goes the list of the required software/apps
## called by the component.
requirements = {
#                'python': '__REQUIRED__',
                }

## TODO: here goes the version of the component seed.
seed_version = '0.99.0'

## TODO: here goes the version of the component itself.
version = '0.99.0'

Example: This is an example showing the content of a component_reqs.py:

env_vars = {'LD_LIBRARY_PATH': []}

memory = '4G'

parallel = True

requirements = {'java': '__REQUIRED__'}

seed_version = 'version 3.2'

version = 'v1.0.1'

Component_ui

It is a python module that contains an argparse UI for the component. Using the make_component command, the following component_ui.py file is generated:

"""
component_ui.py

Note the places you need to change to make it work for you.
They are marked with keyword 'TODO'.
"""

import argparse

#==============================================================================
# make a UI
#==============================================================================
## TODO: pass the name of the component to the 'prog' parameter and a
## brief description of your component to the 'description' parameter.
parser = argparse.ArgumentParser(prog='my_comp',
                                 description = """
                                 brief description of your component goes here.""")

## TODO: create the list of input options here. Add as many as desired.
parser.add_argument(
                    "-x", "--xparam",
                    default = None,
                    help= """
                    help message goes here.
                    """)

## parse the argument parser.
args, unknown = parser.parse_known_args()

Example: This is an example showing the content a component_ui.py:

import sys
import argparse

#==============================================================================
# make a UI
#==============================================================================
parser = argparse.ArgumentParser(prog='snpeff',
                                 description='''Genetic variant annotation and effect
                                 prediction toolbox. It annotates and predicts the
                                 effects of variants on genes (such as amino acid changes)''',
                                 epilog='''Input file: Default is STDIN''')

# required arguments
required_arguments = parser.add_argument_group("Required arguments")

required_arguments.add_argument("--out",
                               default=None,
                               required=True,
                               help='''specify the path/to/out.vcf to save output to a file''')

# mandatory / positional arguments
required_arguments.add_argument("genome_version",
                               choices=['GRCh37.66'],
                               help='''genomic build version''')

required_arguments.add_argument("variants_file",
                               help='''file containing variants''')

# optional options
optional_options = parser.add_argument_group("Options")

optional_options.add_argument("-a", "--around",
                              default=False, action="store_true",
                              help='''Show N codons and amino acids around change
                              (only in coding regions). Default is 0 codons.''')

args, unknown = parser.parse_known_args()

Warning

It is required to use parse_known_args instead of parse_args.

component_seed

This is a directory within the component directory where all the source codes of the actual program reside.

Examples

Please refer to our Github repositories for more examples. The repositories with the postfix _workflow are the pipelines and the rest are the components.