CARLA 2022

Nicole de Miranda Scherer

Title: Improving cloud native pipelines to benefit from available HPC infrastructure

Abstract: In the early days, bioinformatics pipelines were implemented in a series of BASH or Perl scripts, designed to mainly concatenate the execution of programs and handle inputs and outputs. With the advance of Scientific Workflow Management Systems (SWfMS), pipelines become more robust and flexible, incorporating the concept of data provenance to better support reproducibility. Some of the most popular pipelines in genomics have been published by the Broad Institute's Data Sciences Platform using the Cromwell workflow engine and the Workflow Description Language (WDL). These pipelines are optimized to be executed in a cloud environment, particularly on their Terra.bio platform. The idea of adapting a cloud native implementation to an HPC environment means worrying about the configurations that are different in cloud-ready systems and the characteristics that are specific for each type of environment. We have to take into account that the HPC is a shared system, with persistent storage and limited resources. The typical user does not have administrative privileges to run the same container system used in the cloud; execution files will not automatically vanish from storage after the conclusion of the tasks; runtime attributes specified in task definitions may be not recognized in a different environment; and execution engines must interact with a workload manager. Nevertheless, once we succeed in dealing with the challenges of the first workflow, we have built up a toolbox that allows us to easily adapt the next workflows to the HPC environment.

Bio: Dr. Nicole Scherer is a biologist who manages the high performance computing platform for bioinformatics at the Brazilian National Cancer Institute (INCA - Rio de Janeiro). She finished her undergraduate studies in biological sciences (Universidade Federal do Rio Grande do Sul - UFRGS - Porto Alegre, Brazil), followed by a master’s degree in genetics in the same institution with a project in plant evolution to be performed totally in silico. During her graduate studies in bioinformatics (Faculty of Mathematics and Natural Sciences - Heinrich-Heine University, Düsseldorf), she learned programming in Perl and R, and acquired the necessary skills in GNU/Linux for the use of bioinformatics software. In 2011 she joined INCA as a technologist in bioinformatics, providing technical and scientific support to the research groups of the institute. Since the acquisition of the first HPC cluster, in 2013, the bioinformatics platform has grown to reach over 50 users and serve almost 20 research groups. Nicole is also engaged in education and bioinformatics outreach and she is currently a member of the board of directors of the Brazilian Association of Bioinformatics and Computational Biology (AB3C), being on the organizing committee of the X-Meeting, the largest conference for bioinformatics in Brazil.

Contact: nscherer@inca.gov.br