.. _components: ============== Components ============== This part explains how to make new components. You need to know the following definitions first: **seed:** a *seed* is a computer application, a program or in general a command line tool that performs a specific task. This can be a simple bash copy command or a software suite like `MutationSeq `_ or `Strelka `_. **component:** a *component* is a wrapper around a seed that makes the seed compatible with ``kronos`` so that the seed then can be used as a part of a pipeline. In other words, components are the building blocks of the pipelines generated by ``kronos``. .. _develop_component: Develop a component =================== The purpose of components is to modularize workflows with reusable building blocks that require minimal development. The number of lines of codes for making a new component is very small. The simple development instructions eliminate, for example, the need to use Ruffus decorators, input/output management using regex expressions and complicated dependency management in the code that can easily become very complex with the number of tasks in a workflow. Furthermore, a large workflow can be divided into a set of small components that results in a much faster and manageable workflow development. All command line tools can be used as seeds and therefore wrapped as ``Kronos`` components. Regardless of how complicated they are, their corresponding components have a standard directory structure composed of specific wrappers and sub-directories. The wrappers are agnostic to the programming language used for developing the seed. The components should be developed prior to making the workflow. However, since they are individually and independently developed and due to their reusability, the development of a component happens only once and then it can be used in various pipelines. In order to develop a component you need to: 1. create a directory with a name that is the same as the name you want for the component, e.g. ``my_comp``. 2. create a directory called ":ref:`component_seed`" inside the ``my_comp`` directory and copy the seed source code into it. #. create the following files inside the ``my_comp`` directory: - ``__init__.py``: an empty file. - :ref:`component_main.py `: the main python script that contains the ``Component`` class. - :ref:`component_params.py `: contains all the information about input/output parameters of the component. - :ref:`component_reqs.py `: contains all the information about the requirements of the component. - :ref:`component_ui.py `: an `argparse `_ UI for the component. .. topic:: Tip All the above files and directories are generated by ``Kronos`` :ref:`make_component ` command. The user only needs to customize them. .. topic:: Tip The seed source code is not required to be copied into the ``component_seed`` directory. Instead, the seed can be used as a requirement for the component which can be listed in the ``component_reqs.py``. The component directory tree looks like the following: .. code-block:: yaml |-- | |-- __init__.py | |-- component_main.py | |-- component_params.py | |-- component_reqs.py | |-- component_ui.py | |-- component_seed .. note:: ```` should be replaced with the actual name of the component, e.g. ``my_comp``. The rest of the file and directory names should be exactly as shown above. .. topic:: Tip It is recommended to add the following files and directories (not generated automatically) as well: - ``component_test``: a directory where all the test files exist. - ``README``: a readme file to provide more information about the component. .. _component_main: Component_main ^^^^^^^^^^^^^^ The core of a component is the ``component_main.py`` python script. This module defines ``Component`` class which extends the :ref:`ComponentAbstract class `. Using the :ref:`make_component ` command, the following ``component_main.py`` file is generated: .. code-block:: python """ component_main.py This module contains Component class which extends the ComponentAbstract class. It is the core of a component. Note the places you need to change to make it work for you. They are marked with keyword 'TODO'. """ from kronos.utils import ComponentAbstract import os class Component(ComponentAbstract): """ TODO: add component doc here. """ def __init__(self, component_name="my_comp", component_parent_dir=None, seed_dir=None): ## TODO: pass the version of the component here. self.version = "v0.99.0" ## initialize ComponentAbstract super(Component, self).__init__(component_name, component_parent_dir, seed_dir) ## TODO: write the focus method if the component is parallelizable. ## Note that it should return cmd, cmd_args. def focus(self, cmd, cmd_args, chunk): pass # return cmd, cmd_args ## TODO: this method should make the command and command arguments ## used to run the component_seed via the command line. Note that ## it should return cmd, cmd_args. def make_cmd(self, chunk=None): ## TODO: replace 'comp_req' with the actual component ## requirement, e.g. 'python', 'java', etc. cmd = self.requirements['comp_req'] cmd_args = [] args = vars(self.args) ## TODO: fill the following component params to seed params dictionary ## if the name of parameters of the seed are different than ## component parameter names. comp_seed_map = { #e.g. 'component_param1': 'seedParam1', #e.g. 'component_param2': 'seedParam2', } for k, v in args.items(): if v is None or v is False: continue ## TODO: uncomment the next line if you are using ## comp_seed_map dictionary. # k = comp_seed_map[k] cmd_args.append('--' + k) if isinstance(v, bool): continue if isinstance(v, str): v = repr(v) if isinstance(v, (list, tuple)): cmd_args.extend(v) else: cmd_args.extend([v]) if chunk is not None: cmd, cmd_args = self.focus(cmd, cmd_args, chunk) return cmd, cmd_args ## To run as stand alone def _main(): c = Component() c.args = component_ui.args c.run() if __name__ == '__main__': import component_ui _main() .. note:: Note the places you need to change the generated file to make it work for you are marked with keyword 'TODO'. There are two methods in this file that you need to customize: - :ref:`focus ` - :ref:`make_cmd ` .. _focus_method: ``focus`` method **************** Each parallelizable component will require a ``focus`` method. The purpose of this method is to tell the component to process only one :ref:`chunk ` of the input data rather than the entire file. How this is done will vary depending on the component, but basically will add to, or alter the component command to this end. For example, in the following implementation, ``focus`` method simply passes the chunk to the ``--interval`` option in the command arguments ``cmd_arg`` (most of the time, this implementation does the job): .. code-block:: python focus(cmd, cmd_args, chunk): cmd_args.append('--interval ' + chunk) return cmd, cmd_args .. note:: You need to implement ``focus`` method only if the component is parallelizable. .. _make_cmd_method: ``make_cmd`` method ******************* All the components should implement this method in their ``component_main.py``. This method essentially returns the command string that one can use to run the seed on a command line. For example, if the seed can be run using the following command: .. code-block:: python python my_seed_command.py --foo data1 --bar data2 then ``make_cmd`` method would look like this (note that we only need to change the first two lines of the default file made by ``kronos``): .. code-block:: python def make_cmd(self, chunk): path = os.path.join(self.seed_dir, 'my_seed_command.py') cmd = self.requirements['python'] + ' ' + path cmd_args = [] args = vars(self.args) ## TODO: fill the following component params to seed params dictionary ## if the name of parameters of the seed are different than ## component parameter names. comp_seed_map = { #e.g. 'component_param1': 'seedParam1', #e.g. 'component_param2': 'seedParam2', } for k, v in args.items(): if v is None or v is False: continue ## TODO: uncomment the next line if you are using ## comp_seed_map dictionary. # k = comp_seed_map[k] cmd_args.append('--' + k) if isinstance(v, bool): continue if isinstance(v, str): v = repr(v) if isinstance(v, (list, tuple)): cmd_args.extend(v) else: cmd_args.extend([v]) if chunk is not None: cmd, cmd_args = self.focus(cmd, cmd_args, chunk) return cmd, cmd_args .. topic:: Tip In the above example, ``python`` is a requirement for the component and should be added to the :ref:`component_reqs.py ` of the component. Also, parameters ``foo`` and ``bar`` should be added to the :ref:`component_params.py `. .. _abstract_class: ``ComponentAbstract`` class *************************** This class comprises of the following attributes and methods: **Attributes:** .. csv-table:: :header: "Attribute", "Description" :widths: 20, 40 "**args**", "the argparse namespace containing all the input arguments from the :ref:`Component_ui ` module" "**components_dir**", "path to the directory where the component exists" "**component_name**", "name of the component - *specific*" "**component_params**", ":ref:`Component_params ` module of the component" "**component_reqs**", ":ref:`Component_reqs ` module of the component" "**env_vars**", "see :ref:`Component_reqs `" "**memory**", "see :ref:`Component_reqs `" "**parallel**", "see :ref:`Component_reqs `" "**requirements**", "see :ref:`Component_reqs `" "**seed_dir**", "path to the directory where the seed exists. Most of the time it is ``/``" "**seed_version**", "version of the seed" "**version**", "version of the component - *specific*" .. topic:: Tip *specific* means it should be assigned value when implementing a component. **Methods:** .. csv-table:: :header: "Method", "Description" :widths: 20, 40 "**__init__**", "initialize general attributes that each component must have" "**run**", "run the component command locally" "**focus**", "update the command and command arguments for each chunk - *virtual*" "**make_cmd**", "make the command used to run the seed of the component. This returns the same command that one would use to run the component as a stand alone program via command line - *virtual*" "**test**", "run unittest of the component - *virtual*" .. topic:: Tip The class can be imported from ``utils`` module from the ``kronos`` package: .. code-block:: python from kronos.utils import ComponentAbstract .. _component_params: Component_params ^^^^^^^^^^^^^^^^ This is a python module and contains the following information: - *input_files*: a dictionary with keys being the input file parameters and the values being the default values or a proper :ref:`flags ` based on the :ref:`component UI `. For example: .. code-block:: python input_files={'samples':['tumour:__REQUIRED__', 'normal:__REQUIRED__', 'reference:__REQUIRED__', 'model:__REQUIRED__' ], 'config':'some_default.cfg', 'positions_file':None } .. note:: This dictionary includes only parameters that expect input *files* or *directories*. - *output_files*: a dictionary with keys being the output file parameters and the values being the default values or a proper flags based on the component ui. For example: .. code-block:: python output_files = {'export_features':None, 'log_file':'mutationSeq_run.log', 'out':None } .. note:: This dictionary includes only parameters that expect output *files* or *directories*. - *input_params*: a dictionary with keys being the input non_file parameters and the values being the default values or a proper flags based on the component ui. For example: .. code-block:: python input_params = {'all':'__FLAG__', 'buffer_size':'2G', 'coverage':4, 'deep':'__FLAG__', 'interval':None, 'no_filter':'__FLAG__' } .. note:: All other parameters that are not included in *input_files* and *output_files* should be listed in *input_params*. Using the :ref:`make_component ` command, the following ``component_params.py`` file is generated: .. code-block:: python """ component_params.py Note the places you need to change to make it work for you. They are marked with keyword 'TODO'. """ ## TODO: here goes the list of the input files. Use flags: ## '__REQUIRED__' to make it required ## '__FLAG__' to make it a flag or switch. input_files = { # 'input_file1' : '__REQUIRED__', # 'input_file2' : None } ## TODO: here goes the list of the output files. output_files = { # 'output_file1' : '__REQUIRED__', # 'output_file1' : None } ## TODO: here goes the list of the input parameters excluding input/output files. input_params = { # 'input_param1' : '__REQUIRED__', # 'input_param2' : '__FLAG__', # 'input_param3' : None } ## TODO: here goes the return value of the component_seed. ## DO NOT USE, Not implemented yet! return_value = [] **Example:** This is an example showing the content of a ``component_params.py`` file: .. code-block:: python input_files = {'tumour':'__REQUIRED__', 'normal':'__REQUIRED__', 'reference':'__REQUIRED__', 'model':'__REQUIRED__', 'config':'metadata.config', 'positions_file':None } output_files = {'export_features':None, 'log_file':'mutationSeq_run.log', 'out':'__REQUIRED__' } input_params = {'all':'__FLAG__', 'buffer_size':'2G', 'coverage':4, 'deep':'__FLAG__', 'interval':None, 'no_filter':'__FLAG__', 'normalized':'__FLAG__', 'normal_variant':25, 'purity':70, 'mapq_threshold':20, 'baseq_threshold':10, 'indl_threshold':0.05, 'manifest':'__OPTIONAL__', 'single':'__FLAG__', 'threshold':0.5, 'tumour_variant':2, 'features_only':'__FLAG__', 'verbose':'__FLAG__', 'titan_mode':'__FLAG__' } .. _component_reqs: Component_reqs ^^^^^^^^^^^^^^ This is a python module and contains the following information: - *env_vars*: a dictionary with keys being the name of environment variables and values being the path/content to export. The values can be updated in the configuration file using :ref:`env_var ` in the run subsection. Therefore, it is recommended not include the paths as values in this file and instead use an empty list, ``[]``, or *None* as a value. - *memory*: specifies the minimum memory required by the component to properly run on a cluster. The format is ``nG``, e.g. 30G. - *parallel*: a boolean flag that specifies whether or not a component can run in parallel mode. - *requirements*: a dictionary with keys usually being the name of a program/software and values being *None* or the flag *__REQUIRED__* . The values will be later updated by ``kronos`` using the content of the :ref:`__GENERAL__ ` section. - *seed_version*: the version of the seed. - *version*: the version of the component. Using the :ref:`make_component ` command, the following ``component_reqs.py`` file is generated: .. code-block:: python """ component_reqs.py Note the places you need to change to make it work for you. They are marked with keyword 'TODO'. """ ## TODO: here goes the list of the environment variables, if any, ## required to export for the component to function properly. env_vars = { # 'env_var1' : ['value1', 'value2'], # 'env_var2' : 'value3' } ## TODO: here goes the max amount of the memory required. memory = '5G' ## TODO: set this to True if the component is parallelizable. parallel = False ## TODO: here goes the list of the required software/apps ## called by the component. requirements = { # 'python': '__REQUIRED__', } ## TODO: here goes the version of the component seed. seed_version = '0.99.0' ## TODO: here goes the version of the component itself. version = '0.99.0' **Example:** This is an example showing the content of a ``component_reqs.py``: .. code-block:: python env_vars = {'LD_LIBRARY_PATH': []} memory = '4G' parallel = True requirements = {'java': '__REQUIRED__'} seed_version = 'version 3.2' version = 'v1.0.1' .. _component_ui: Component_ui ^^^^^^^^^^^^ It is a python module that contains an `argparse `_ UI for the component. Using the :ref:`make_component ` command, the following ``component_ui.py`` file is generated: .. code-block:: python """ component_ui.py Note the places you need to change to make it work for you. They are marked with keyword 'TODO'. """ import argparse #============================================================================== # make a UI #============================================================================== ## TODO: pass the name of the component to the 'prog' parameter and a ## brief description of your component to the 'description' parameter. parser = argparse.ArgumentParser(prog='my_comp', description = """ brief description of your component goes here.""") ## TODO: create the list of input options here. Add as many as desired. parser.add_argument( "-x", "--xparam", default = None, help= """ help message goes here. """) ## parse the argument parser. args, unknown = parser.parse_known_args() **Example:** This is an example showing the content a ``component_ui.py``: .. code-block:: python import sys import argparse #============================================================================== # make a UI #============================================================================== parser = argparse.ArgumentParser(prog='snpeff', description='''Genetic variant annotation and effect prediction toolbox. It annotates and predicts the effects of variants on genes (such as amino acid changes)''', epilog='''Input file: Default is STDIN''') # required arguments required_arguments = parser.add_argument_group("Required arguments") required_arguments.add_argument("--out", default=None, required=True, help='''specify the path/to/out.vcf to save output to a file''') # mandatory / positional arguments required_arguments.add_argument("genome_version", choices=['GRCh37.66'], help='''genomic build version''') required_arguments.add_argument("variants_file", help='''file containing variants''') # optional options optional_options = parser.add_argument_group("Options") optional_options.add_argument("-a", "--around", default=False, action="store_true", help='''Show N codons and amino acids around change (only in coding regions). Default is 0 codons.''') args, unknown = parser.parse_known_args() .. warning:: It is required to use ``parse_known_args`` instead of ``parse_args``. .. _component_seed: component_seed ^^^^^^^^^^^^^^ This is a directory within the component directory where all the source codes of the actual program reside. Examples ========= Please refer to our Github `repositories `_ for more examples. The repositories with the postfix ``_workflow`` are the pipelines and the rest are the components.