Program – CARLA

Tutorials Day | September 22, 2025
Click here for session summaries
Registration opens at 8:15 AM in the Grand Jamaica Suite lobby.

Track 1 Venue: Grand Jamaica (Montego) Suite	Track 2 Venue: Grand Jamaica (Negril) Suite	Track 3 Venue: Grand Jamaica (Port Antonio) Suite
[Empty]	9:00 am. A Programming Introduction to HPC	9:00 am. Simulating quantum algorithms with Q-Team
11:00 -11:30 am. Coffee Break
[Empty]	A Programming Introduction to HPC	Simulating quantum algorithms with Q-Team
1:00 – 2:00 pm. Lunch
[Empty]	2:00 pm. A Programming Introduction to HPC	2:00 pm. Dist. Deep Learning: A Tutorial on Distributed Training Techniques for Large Deep Learning Models
4:00- 4:30 pm. Coffee Break
[Empty]	A Programming Introduction to HPC	Dist. Deep Learning: A Tutorial on Distributed Training Techniques for Large Deep Learning Models

Workshops Day | September 23, 2025
Click here for session summaries
Registration opens at 8:30 AM in the Grand Jamaica Suite lobby.

Track 1 Venue: Grand Jamaica (Montego) Suite	Track 2 Venue: Grand Jamaica (Negril) Suite	Track 3 Venue: Grand Jamaica (Port Antonio) Suite	Track 4 Venue: Belisario Suite
9:00 am. Latin America and Caribbean Advances on Weather Forecasting	9:00 am. Energy Efficiency and Sustainability in AI, HPC, and Quantum Computing	9:00 am. HPC Centers Around the World – Science and Computing Perspectives for the Latin American and Caribbean Community	9:00 am. BioCARLA 2025 – Challenges and Advances in Small/Foundation Models for Biomedical and Life Sciences
11:00 – 11:30 am. Coffee Break
Latin America and Caribbean Advances on Weather Forecasting	Energy Efficiency and Sustainability in AI, HPC, and Quantum Computing	HPC Centers Around the World – Science and Computing Perspectives for the Latin American and Caribbean Community	BioCARLA 2025 – Challenges and Advances in Small/Foundation Models for Biomedical and Life Sciences
1:00 – 2:00 pm. Lunch
Latin America and Caribbean Advances on Weather Forecasting	Energy Efficiency and Sustainability in AI, HPC, and Quantum Computing	Advanced Computing Trends in Latin America, the Caribbean, and the World Workshop	Women in High Performance Computing (WHPC)
4:00 – 4:30 am. Coffee Break
Latin America and Caribbean Advances on Weather Forecasting	Energy Efficiency and Sustainability in AI, HPC, and Quantum Computing	Advanced Computing Trends in Latin America, the Caribbean, and the World Workshop	Women in High Performance Computing (WHPC)

Location: Grand Jamaica Suite
Registration opens at 8:30 AM in the Grand Jamaica Suite lobby.

Dr. Vanesa M. Tennant Williams, Moderator

CARLA 2025 Welcome Message
Dr. Kevin A. Brown
General Co-Chair
CARLA 2025

Remarks
Dr. Charah T. Watson
Executive Director
Scientific Research Council

Prof. Tannecia Stephenson
Deputy Dean, Faculty of Science and technology
Co-Director of the Climate Studies group, Mona (CSGM)
University of the West Indies, Mona

Mrs. Anika C. D. Shuttleworth
Chief Information Officer
JAMICTA - ICT Authority

Dr. Carlos Jaime Barrios Hernandez
General Chair
SCALAC

Abstract. NASA’s missions span human and robotic space exploration, ground-breaking aeronautics research, and Earth and space sciences. This talk will provide a broad overview of numerous HPC and related technologies that NASA develops, adapts, and implements for its wide spectrum of programs and projects ranging from Earth to deep space.

Biography. Dr. Rupak Biswas is currently the Director of Exploration Technology at NASA Ames Research Center, Moffett Field, Calif., and has held this Senior Executive Service (SES) position since January 2016. In this role, he is in charge of planning, directing, and coordinating the technology development and operational activities of the organization that comprises of advanced supercomputing, human systems integration, intelligent systems, and entry systems technology. The directorate consists of approximately 700 employees with an annual budget of $160 million, and includes two of NASA’s critical and consolidated infrastructures: arcjet testing facility and supercomputing facility. He is also the Manager of the High End Computing Capability Project that provides a full range of advanced computational resources and services to numerous NASA programs. In addition, he leads the emerging quantum computing effort for NASA. Dr. Biswas received his Ph.D. in Computer Science from Rensselaer in 1991, and has been at NASA ever since. During this time, he has received several agency awards, including the Exceptional Achievement Medal and the Outstanding Leadership Medal. He is an internationally recognized expert in high performance computing and has published more than 150 technical papers, received many Best Paper awards, edited several journal special issues, and given numerous lectures around the world.

What you "REALLY" need to know about #AI (from AI to AgenticAI)
by Francisco Aguirre (Dell Technologies) & Pedro Mario Cruz e Silva (NVIDIA)

Francisco Aguirre,
LATAM NVIDIA Solutions Senior Principal, Dell Technologies

Francisco Aguirre is an experienced technology leader with more than 30 years in the IT industry, specializing in Artificial Intelligence, High-Performance Computing, and emerging technologies. He currently leads NVIDIA solutions for Dell Technologies in Latin America, helping organizations harness the power of accelerated computing to drive innovation and competitive advantage.

Throughout his career, Francisco has advised clients across key industries—including finance, telecommunications, retail, airlines, and education—on how to adopt and scale transformative technologies.

His expertise spans from data analytics, business intelligence, and big data, to modern AI deployments leveraging NVIDIA platforms, GPU-based architectures, and hybrid cloud strategies.

Francisco is recognized as a trusted advisor, speaker, and thought leader. He has delivered keynotes and technical sessions at major industry events such as Dell Technologies World, Dell Technologies Forum, and Mexico Business Forum.

His presentations focus on making cutting-edge concepts like Generative AI, Retrieval-Augmented Generation, Agentic AI, and Quantum Computing accessible to both business and technical audiences.

He holds a degree in Systems and Computer Science Engineering from La Salle University and a master’s degree in Customer Relationship Management from Duke University. Passionate about innovation, Francisco continues to drive conversations at the intersection of technology, business, and culture.

Pedro Mário Cruz e Silva,
Senior Solutions Architect | NVIDIA Latin America

Pedro Mário Cruz e Silva did his BSc (1995), and MSc (1998) at Federal University of Pernambuco (UFPE), he also did his DSc in 2004 at PUC-Rio. He created the Computational Geophysics Group atPUC-Rio were worked for 15 years as Manager, during this period was responsible for several Software Development and R&D projects for Geophysics with strong focus on innovation. He also finished an MBA in 2015 at Getúlio Vargas Foundation (FGV/RJ). Currently is Senior Solutions Architect for Higher-Education and Research for the Latin America Region.

Served in the Grand Jamaica Suite pre-function area

11:40 AM - 1:00 PM
Chair: Esteban Mocskos

Machine Learning for Predicting Job States and CPU Power on a Supercomputer
Performance and Energy Consumption Prediction of Scientific Workflows using Machine Learning
Profiling a task-based molecular dynamics application with a data science approach
Investigating the Impact of DVFS on the Energy Efficiency of AI Workloads on GPUs

Authors. Dylan Benavides Castillo (Costa Rica National High Tecnology Center - CENAT); Fabricio Quirós Corella (Costa Rica National High Tecnology Center - CENAT); Esteban Meneses (Costa Rica National High Technology Center - CENAT)

Abstract. Efficient resource management in high-performance computing (HPC) is essential for optimizing costs, reducing energy consumption, and improving system productivity. However, job variability and failures introduce uncertainties that complicate scheduling and resource allocation. Accurately predicting job failures and estimating energy consumption can enhance planning and operational efficiency. This study analyzes data from the Simple Linux Utility for Resource Management (SLURM) on the Kabré supercomputer at Costa Rica’s National High Technology Center (CeNAT). After selecting and preprocessing relevant variables, a dataset was created to train a two-stage machine learning model comprising a binary classifier and a regression model. Using 10-fold cross-validation, multiple models were evaluated, with Random Forest emerging as the best performer in both stages. The classification model was assessed using the confusion matrix and ROC curve, while the regression model was evaluated through residual analysis and metrics such as Root Mean Square Error (RMSE) and the Coefficient of Determination (R^2). This approach can support users and administrators by improving job scheduling decisions and reducing energy waste in HPC systems.

Authors. Felipe Barbosa (Federal University of Pará - UFPA); Josivaldo de Souza Araújo (Federal University of Pará - UFPA); Marcos Amarís (Federal University of Pará - UFPA); Erick Damasceno (Federal University of Pará - UFPA); Fellipe Queiroz (Federal University of Pará - UFPA); Josiany Brito Guimarães (Federal University of Pará - UFPA); Daniel Cordeiro (University of São Paulo - USP)

Abstract. In High Performance Computing (HPC), large-scale scientific workflows are essential for modern discoveries but lead to significant energy consumption. This work explores predictive models to estimate both energy consumption and performance, to support sustainable computing in HPC environments. We used WfCommons to generate workflows, Wrench to simulate the supercomputing environments, and Scikit-learn to implement machine learning algorithms. Regression models, including ensemble techniques, were developed and evaluated using widely adopted scientific workflows such as BLAST, Montage, and Epigenomics. For training, features included IO time (in seconds) and the amount of bytes read and written. In energy consumption prediction, the Gradient Boosting Regressor (GBR) achieved high R^2 scores, such as 0.8556 for Epigenomics and 0.7143 for BLAST. For performance prediction, GBR also showed superior accuracy, with MAE and MAPE as low as 0.0257 and 0.0068, respectively, in the BLAST workflow. These results confirm the effectiveness of ensemble models in energy efficiency and performance, contributing to sustainable scientific computing.

Authors. Christian Asch (Costa Rica National High Technology Center - CENAT); Lucas Mello Schnorr (Federal University of Rio Grande do Sul - UFRGS); Esteban Meneses (Costa Rica National High Technology Center - CENAT)

Abstract. Charm++ is a parallel programming framework based on task-driven execution and global object references. It has been used successfully in various high-performance computing (HPC) applications, including the molecular dynamics simulator NAMD. While Charm++ includes built-in support for performance tracing and visualization through its Projections tool, the existing system offers limited extensibility and has no support for modern data science workflows. This work presents a new visualization and analysis pipeline for Charm++ trace data that emphasizes modularity, openness, and composability. Our toolchain leverages standard scripting languages and data formats - producing output in CSV and Parquet formats - to facilitate integration with data analysis ecosystems. We demonstrate the effectiveness of this approach using LeanMD, a proxy application derived from NAMD, and highlight specific types of custom visualizations that are difficult to achieve with Projections. Our system enables custom visualizations and streamlined analysis of chare-level execution behavior, offering researchers and tool developers improved capabilities for understanding program performance and identifying load imbalance. We discuss the architecture of our tool, its application to real-world traces, and potential extensions for other task-based frameworks.

Authors. Arthur Lorenzon (Federal University of Rio Grande do Sul - UFRGS); Thiago Goncalves (Federal University of Rio Grande do Sul - UFRGS)

Abstract. The increasing scale of artificial intelligence (AI) models has led to unsustainable energy consumption in GPU-based systems, creating an important need for more efficient computing strategies. While Dynamic Voltage and Frequency Scaling (DVFS) can be an efficient technique, the combined impact of GPU core and memory frequencies is often not systematically evaluated. This paper presents a comprehensive analysis of how tuning both GPU core and memory frequencies affects performance, energy consumption, and the Energy-Delay Product (EDP) for AI workloads. We evaluate seven benchmarks with diverse computational demands on an AMD Radeon RX 7700XT GPU across 12 distinct frequency configurations. Our results reveal that memory-bound applications can benefit from the increase of memory frequency, reducing execution time by up to 80.7%, and compute-bound applications benefited more from higher core frequencies. In addition, by selecting appropriate operating frequencies, it is possible to reduce the EDP by up to 98.3% comparing to the worst-case scenario, highlighting the possible energy efficiency gains of choosing a balanced configuration that optimizes the energy-performance trade-off.

Ricardo Baeza-Yates
KTH Royal Institute of Technology, Sweden
Universitat Pompeu Fabra, Barcelona
Universidad de Chile

Abstract. Machine learning (ML), particularly deep learning, is being used everywhere. However, not always is used well, ethically and scientifically. In this talk we first do a deep dive in the limitations of supervised ML and data, its key component. We cover small data, datification, bias, predictive optimization issues, evaluating success instead of harm, and pseudoscience, among other problems. The second part is about our own limitations using ML, including different types of human incompetence: cognitive biases, unethical applications, no administrative competence, misinformation, and the impact on mental health. In the final part we discuss regulation on the use of AI and responsible AI principles, that can mitigate the problems outlined above.

Biography. Ricardo Baeza-Yates is a a part-time WASP Professor at KTH Royal Institute of Technology in Stockholm, as well as part-time professor at the departments of Engineering of Universitat Pompeu Fabra in Barcelona and Computer Science of University of Chile in Santiago. Before, he was VP of Research at Yahoo Labs, based in Barcelona, Spain, and later in Sunnyvale, California, from 2006 to 2016. He is co-author of the best-seller Modern Information Retrieval textbook published by Addison-Wesley in 1999 and 2011 (2nd ed), that won the ASIST 2012 Book of the Year award. In 2009 he was named ACM Fellow and in 2011 IEEE Fellow. He has won national scientific awards in Chile (2024) and Spain (2018), among other accolades and distinctions. He obtained a Ph.D. in CS from the University of Waterloo, Canada, and his areas of expertise are responsible AI, web search and data mining plus data science and algorithms in general.

Moderator: Addison Snell (Intersect360)

4:35 PM - 5:55 PM
Chair: Kyle Felker, Tadeu Gomes

Evaluating Malleable Job Scheduling in HPC Clusters using Real-World Workloads
Optimizing the Energy-Efficiency of QMCPACK on Aurora Supercomputer via GPU Sharing
Good Sustainability Practices for Data Center: A Systematic Literature Review
Parallel/distributed computing for optimizing investment planning in electricity markets

Authors. Patrick Zojer (University of Kassel); Jonas Posner (University of Kassel); Taylan Özden (Technical University of Darmstadt)

Abstract. Optimizing resource utilization in high-performance computing (HPC) clusters is essential for maximizing both system efficiency and user satisfaction. However, traditional rigid job scheduling often results in underutilized resources and increased job waiting times. This work evaluates the benefits of resource elasticity, where the job scheduler dynamically adjusts the resource allocation of malleable jobs at runtime. Using real workload traces from the Cori, Eagle, and Theta supercomputers, we simulate varying proportions (0–100%) of malleable jobs with the ElastiSim software. We evaluate five job scheduling strategies, including a novel one that maintains malleable jobs at their preferred resource allocation when possible. Results show that, compared to fully rigid workloads, malleable jobs yield significant improvements across all key metrics. Considering the best-performing scheduling strategy for each supercomputer, job turnaround times decrease by 37–67%, job makespan by 16–65%, job wait times by 73–99%, and node utilization improves by 5–52%. Although improvements vary, gains remain substantial even at 20% malleable jobs. This work highlights important correlations between workload characteristics (e.g., job runtimes and node requirements), malleability proportions, and scheduling strategies. These findings confirm the potential of malleability to address inefficiencies in current HPC practices and demonstrate that even limited adoption can provide substantial advantages, encouraging its integration into HPC resource management.

Authors. Matheus Costa (Federal University of Rio Grande do Sul - UFRGS); Philippe Navaux (Federal University of Rio Grande do Sul - UFRGS); Silvio Rizzi (Argonne National Laboratory); Arthur Lorenzon (Federal University of Rio Grande do Sul - UFRGS)

Abstract. As high-performance computing advances toward the exascale, energy efficiency has become a primary concern. However, complex scientific applications often underutilize GPU hardware, such as GPUs, leaving valuable resources idle. In this scenario, this paper investigates GPU sharing as a strategy to improve resource utilization and energy efficiency for the QMCPACK application on the Aurora supercomputer. We evaluate Intel's Multiple Compute Command Streamers (CCS) technology across a scale of 1 to 16 nodes on Aurora, comparing a baseline single-rank-per-GPU-tile configuration (1-CCS) against multi-rank setups with two (2-CCS) and four (4-CCS) ranks per tile. Our results demonstrate that GPU sharing via the 2-CCS mode yields significant performance improvements regarding the Figure of Merit (FOM) by an average of 17% and enhances energy efficiency (FOMe) by 32% compared to the baseline configuration that uses 1-CCS only. We also show that the 4-CCS configuration suffers from increased MPI communication overhead and lower vectorization efficiency, being worse for performance and energy than running with only 1-CCS. On the other hand, 2-CCS achieves a better balance between higher compute unit engagement and reduced memory system stalls. We also show that exploiting GPU sharing can be an effective strategy for boosting both throughput and sustainability on modern HPC systems, but over-partitioning can introduce overheads that cancel out the benefits.

Authors. Josiany Brito Guimarães (Federal University of Pará - UFPA); Felipe Barbosa (Federal University of Pará - UFPA); Erick Damasceno (Federal University of Pará - UFPA); Fellipe Queiroz (Federal University of Pará - UFPA); Marcos Amarís (Federal University of Pará - UFPA)

Abstract. The search for sustainable practices and energy efficiency in data centers has become a top priority in today’s world, driven by the rapid growth in internet use and online traffic. This highlights the urgent need for research that focuses on practical strategies to reduce the environmental impact of these vital facilities. This article reviews 10 relevant studies that examine the primary methods used to enhance energy efficiency in data centers, including emerging technologies, resource management techniques, and sustainable policies. Additionally, the paper discusses the environmental impacts of data centers, including high energy consumption, greenhouse gas emissions, and intensive use of natural resources. The findings indicate that implementing good sustainability practices can bring several benefits, such as lowering operating costs and reducing the carbon footprint, while maintaining service quality.

Authors. Santiago Freire (Universidad de la República); Sergio Nesmachnow (Universidad de la República); Pedro Moreno (Universidad Autonoma del Estado de Morelos - UAEM)

Abstract. This article presents a parallel and distributed computing strategy for optimizing investment planning in electricity markets. An implementation is developed for a stochastic dynamic programming model used by the Uruguayan electrical company to determine investment strategies in generation assets. The sequential execution of thousands of solver instances results in high computing times, rendering the process inefficient for large-scale scenarios. A distributed architecture based on Message Passing Interface is proposed to enable parallel execution across multiple computing resources. The solution includes modular components organized in a layered architecture, load balancing and fault recovery features. Performance results indicate that the parallel implementation allows addressing complex scenarios with high computing demands. Significant reductions in execution time were obtained, with high speedup values (up to 98x) and efficiency up to 0.86 for a full-scale scenario executed on distributed nodes of a high-performance computing cluster.

Location: Talk of the Town
Registration opens at 8:30 AM in the Talk of the Town lobby.

Abstract. We live in interesting times where new ideas in computing and technology emerge at ever increasing rates in AI, edge computing and IoT, and programmable networking to name just a few. These innovations open up unprecedented opportunities for all kinds of scientific progress from biotechnology to engineering. Given their critical role, the question arises who creates opportunities for computer science? How can we create a scientific instrument where computer science ideas can be tried, tested, and adopted or discarded? What would such a scientific instrument look like and would it engage with its community? How would such instrument evolve to follow the evolution of science? And lastly, how would it negotiate transition from innovation to mainstream adoption?

In this talk, I will talk about how to build this type of scientific instrument supporting the exploration of new ideas in the cyberinfrastructure space. I will share the insights, design strategy, and the lessons learned from building and operating the Chameleon computer science research platform through the last decade. We will take the journey from a base cloud testbed design and track its evolution through diversifying its hardware to support innovative architectures (e.g., Fugaku nodes), accelerators, disaggregated hardware (Liqid, GigaIO), and ultimately taking it form the datacenter and into the field by introducing support for edge hardware based on single board computers (Raspberry Pis and NVIDIA nanons). We will see how the platform evolved to support emergent ideas coming from its by now 13,000 strong user community at a reasonable cost, and allowing it to support a significant scientific output of 800+ publications. Lastly, I will share stories of research and education projects in the edge to cloud continuum and discuss their impact both on science and on scientific sharing through reproducible digital artifacts.

Biography. Kate Keahey is one of the pioneers of infrastructure cloud computing. She created the Nimbus project, recognized as the first open source Infrastructure-as-a-Service implementation, and continues to work on research aligning cloud computing concepts with the needs of scientific datacenters and applications. To facilitate such research for the community at large, Kate leads the Chameleon project, providing a deeply reconfigurable, large-scale, and open experimental platform for Computer Science research. To foster the recognition of contributions to science made by software projects, Kate co-founded the SoftwareX journal, a new format designed to publish software contributions. Kate is a Scientist at Argonne National Laboratory and a Senior Scientist The University of Chicago Consortium for Advanced Science and Engineering (UChicago CASE).

AI for Science and Engineering (Lenovo - Nvidia)

Ulysses Darly Galasso
Sales Manager, HPC | Lenovo ISG LA

Pedro Mario Cruz e Silva pcruzesilva@nvidia.com
Senior Solutions Architect | NVIDIA Latin America

Accelerating HPC and AI with Lenovo Servers Powered by Intel® Xeon® 6 (Intel - Lenovo)

Tarcisio Alves
Industry Technology Sales Specialist DC/AI | Intel Brazil.

Abstract. Utilization of AMD's Epyc CPU's for Parallel Processing in HPC Environments For Super Compute applications, and Cutting Edge Artificial Intelligence Analytics

Biography. Miguel Tiempos,
Long track in HPC coming from Cray Super Computers and later Acquisition by HPE.
Has worked on many deployments of High End Supercomputer systems for all types of workloads for Research and Higher Education Organizations as Well as National Security Entities across Latin America.
Deep knowledge in the Scheduling of HPC Workloads, Hybrid ecosystems for ML Trainings, container based deployments, and automation of scalable infrastructure solutions for HPC and AI.

Enabling Accessible, Secure, and Scalable AI on HPC Infrastructure

Abstract. The convergence of High-Performance Computing (HPC) and Artificial Intelligence (AI), particularly large language models (LLMs), is reshaping the computational landscape for both scientific and enterprise domains. However, significant technical and usability barriers hinder the seamless integration of HPC resources with AI/LLM workflows. This paper presents a comprehensive, modular framework architecture that bridges these domains, addressing core challenges in usability, automation, security, and extensibility. The proposed framework integrates portal/API layers, orchestration engines, data fabrics, containerization, workflow automation, monitoring/reporting, and robust security. We detail its design principles, highlight its modularity and support for hybrid deployments and federated learning, and analyze real-world use cases in academic research, enterprise AI, and scientific simulation. Our evaluation demonstrates enhanced accessibility, scalability, and reproducibility, positioning the framework as a foundation for future AI-HPC integration.

Biography. Lincoln V. Walters, CEO, LVW ELECTRONICS SYSTEMS INC.

Served in the Talk of the Town

TBA

12:35 PM - 1:15 PM
Chair: Harold Castro

Driving Computational Efficiency in Large-Scale Platforms using HPC Technologies
A Scalable and Reproducible Parsl Framework for Molecular evolutionary Analyses on HPC Systems

Authors. Alexander Martínez Méndez (Universidad Industrial de Santander - UIS); Antonio Rubio Montero (Centre for Energy, Environmental and Technological Research - CIEMAT); Carlos Jaime Barrios Hernandez (Universidad Industrial de Santander - UIS); Hernán Asorey (piensas.xyz); Rafael Mayo-García (Centre for Energy, Environmental and Technological Research - CIEMAT); Luis Alberto Nuñez Villavicencio (Universidad Industrial de Santander - UIS)

Abstract. The Latin American Giant Observatory (LAGO) project utilizes extensive High-Performance Computing (HPC) resources for complex astroparticle physics simulations, making resource efficiency critical for scientific productivity and sustainability. This article presents a detailed analysis focused on quantifying and improving HPC resource utilization efficiency specifically within the LAGO computational environment. The core objective is to understand how LAGO's distinct computational workloads—characterized by a prevalent coarse-grained, task-parallel execution model—consume resources in practice. To achieve this, we analyze historical job accounting data from the EGI FedCloud platform, identifying primary workload categories (Monte Carlo simulations, data processing, user analysis/testing) and evaluating their performance using key efficiency metrics (CPU utilization, walltime utilization, and I/O patterns). Our analysis reveals significant patterns, including high CPU efficiency within individual simulation tasks contrasted with the distorting impact of short test jobs on aggregate metrics. This work pinpoints specific inefficiencies and provides data-driven insights into LAGO's HPC usage. The findings directly inform recommendations for optimizing resource requests, refining workflow management strategies, and guiding future efforts to enhance computational throughput, ultimately maximizing the scientific return from LAGO's HPC investments.

Authors. Rafael Terra (National Laboratory for Scientific Computing - LNCC); Hugo Oliveira (National Laboratory for Scientific Computing - LNCC); Daniel Janies (University of North Carolina at Charlotte); Hiago Rocha (National Laboratory for Scientific Computing - LNCC); Diego Carvalho (Federal Center for Technological Education Celso Suckow da Fonseca - CEFET/RJ); Carla Osthoff (National Laboratory for Scientific Computing - LNCC); Kary Ocaña (National Laboratory for Scientific Computing - LNCC)

Abstract. Codon-based model testing is fundamental to molecular evolution studies. The growing complexity of these analyses -- driven by computationally intensive likelihood estimations, memory-demanding datasets, and the need to scale across hundreds or thousands of genes -- necessitates efficient use of high-performance computing resources. To address this need, we present HighSPA, a scalable and reproducible framework that integrates two widely used tools for evolutionary analysis -- CodeML and HyPhy -- into parallel workflows using the Parsl library. We validated HighSPA using DENV genomes from Brazil (serotypes 1–4), applying six codon substitution models. When using the CodeML workflow, HighSPA identified high-confidence positively selected sites (PSS) with serotype-specific patterns. In contrast, the HyPhy workflow detected fewer PSS, likely due to its more conservative inference approach. In terms of performance, HighSPA-HyPhy significantly reduced makespan and increased throughput -- by an average of 87x and 89x, respectively -- compared to the sequential execution of the analyses. These results support the presence of adaptive evolution in key genes such as E, NS3, and NS5, and demonstrate HighSPA’s effectiveness for large-scale evolutionary analysis.

Abstract. To understand the scaling behavior of HPC applications, developers often use performance models. A performance model is a formula that expresses a critical performance metric, such as runtime, as a function of one or more execution parameters, such as core count and input size. Performance models offer quick insights on a very high level of abstraction, including predictions of future behavior. Given the complexity of today’s applications, which often combine several sophisticated algorithms, creating performance models manually is extremely laborious. Empirical performance modeling, the process of learning such models from performance data, offers a convenient alternative but comes with its own set of challenges. The two most prominent ones are noise and the cost of the experiments needed to generate the underlying data. In this talk, we will review the state of the art in empirical performance modeling and investigate how we can employ machine learning and other strategies to improve the quality and lower the cost of the resulting models.

Biography. Felix Wolf is a full professor at the Department of Computer Science of the Technical University of Darmstadt in Germany, where he leads the Laboratory for Parallel Programming. He works on methods, tools, and algorithms that support developing and deploying parallel software systems in various life-cycle stages. Wolf received his Ph.D. degree from RWTH Aachen University in 2003. After working more than two years as a postdoc at the Innovative Computing Laboratory of the University of Tennessee, he was appointed research group leader at Juelich Supercomputing Centre. Between 2009 and 2015, he was head of the Laboratory for Parallel Programming at the German Research School for Simulation Sciences in Aachen and a full professor at RWTH Aachen University. Wolf has made major contributions to several open-source performance tools for parallel programs, including Scalasca, Score-P, and Extra-P. Moreover, he has initiated the Virtual Institute – High Productivity Supercomputing, an international initiative of HPC programming-tool builders aimed at enhancing, integrating, and deploying their products. He has published over 150 refereed articles on parallel computing, several of which have received awards.

Genaro Costa
HPC & AI Solution Architect | EVIDEN

Morris Skupinsky

HPC & AI Solution Architect | DDN

Served in the Legacy Suite

4:20 PM - 6:00 PM | Legacy Suite

Comparative Performance Analysis of DNA Sequence Encoding Methods for Machine Learning-Based Bacterial Classification, Diego Santibanez Oyarce (Universidad Tecnológica Metropolitana, Santiago, Chile), Jorge Vergara-Quezada (Universidad Tecnológica Metropolitana, Santiago, Chile), Ana Moya-Beltrán (Universidad Tecnológica Metropolitana, Santiago,Chile)
Application of Language Models for the Functional Annotation of Conserved Domains in Biological Data, Hugo Osses Prado (Universidad Tecnológica Metropolitana, Santiago, Chile), Raúl Caulier-Cisterna (Universidad Tecnológica Metropolitana, Santiago, Chile), Ana Moya-Beltrán (Universidad Tecnológica Metropolitana, Santiago ,Chile)
Optimizing path analysis in multi-perspective graphs: A study on the migration from NetworkX to graph-tool, Welber P. Ferreira (LNCC, Brazil), Antônio T. A. Gomes (LNCC, Brazil)
LUAD-SynthNet: Generative Adversarial Networks for Synthetic Single-Cell Transcriptomics in Lung Adenocarcinoma, Joaquín Araya-Bustos (Universidad Tecnológica Metropolitana, Chile), Welinton Barrera-Mondaca (Universidad Tecnológica Metropolitana, Chile), Renato Álvarez-Ramos (Universidad Tecnológica Metropolitana, Chile), Claudia Cancino-Quiroz (Universidad Tecnológica Metropolitana, Chile), Jorge Vergara-Quezada (Universidad Tecnológica Metropolitana, Chile), Ana Moya-Beltrán (Universidad Tecnológica Metropolitana, Chile)
Performance-Guided Evaluation of Clustering Strategies for Single-Cell RNA Sequencing in Cancer Research within HPC Environments, Welinton Barrera-Mondaca (Universidad Tecnológica Metropolitana, Chile), Joaquín Araya-Bustos (Universidad Tecnológica Metropolitana, Chile), Renato Álvarez-Ramos (Universidad Tecnológica Metropolitana, Chile), Claudia Cancino-Quiroz (Universidad Tecnológica Metropolitana, Chile), Raúl Caulier-Cisterna (Universidad Tecnológica Metropolitana, Chile), Ana Moya-Beltrán (Universidad Tecnológica Metropolitana, Chile)
Landscape of Machine Learning Methods and Data Representations for Antimicrobial Resistance: Toward a Benchmarking Framework in HPC Environments, Camilo Cerda Sarabia (Universidad Tecnológica Metropolitana, Chile), Fernanda Bravo Cornejo (Universidad Tecnológica Metropolitana, Chile), Belén Díaz Díaz (Universidad Tecnológica Metropolitana, Chile), Fausto Cabezas-Mera (Universidad Tecnológica Metropolitana, Chile), Jorge Vergara-Quezada (Universidad Tecnológica Metropolitana, Chile), Ana Moya-Beltrán (Universidad Tecnológica Metropolitana, Chile)
GenomeDefender: Validated High-Precision Detection of Data Poisoning Attacks in Single-Cell RNA-seq Data using a Multi-Model Ensemble, Renato Álvarez Ramos (Universidad Tecnológica Metropolitana, Chile), Claudia Cancino Quiroz (Universidad Tecnológica Metropolitana, Chile), Joaquín Araya Bustos (Universidad Tecnológica Metropolitana, Chile), Welinton Barrera Mondaca (Universidad Tecnológica Metropolitana, Chile), Ana Moya-Beltrán (Universidad Tecnológica Metropolitana, Chile), Victor Escobar Jeria (Universidad Tecnológica Metropolitana, Chile)
High-Performance Computing Evaluation of GATK and Parabricks for Genetic Biomarker Detection in Cancer using scRNA-seq, Claudia Cancino-Quiroz (Universidad Tecnológica Metropolitana, Chile), Renato Alvarez-Ramos (Universidad Tecnológica Metropolitana, Chile), Welinton Barrera-Mondaca (Universidad Tecnológica Metropolitana, Chile), Joaquín Araya-Bustos (Universidad Tecnológica Metropolitana, Chile), Victor Escobar (Universidad Tecnológica Metropolitana, Santiago,Chile), Ana Moya-Beltrán (Universidad Tecnológica Metropolitana, Chile)
Aerodynamic design and CFD simulation of a compact car using OpenFOAM: A case study from the city of Bucaramanga, Colombia, Adrian Vargas-Lizarazo (Universidad Industrial de Santander, Colombia), Jorge Luis Chacón-Velasco (Universidad Industrial de Santander, Colombia)
Qualitative assessment of High-Performance Computing (HPC) ecosystem in Panama: needs and potential users, Ivan Bonilla (Universidad Tecnológica de Panamá, Panamá), Esteban Meneses (Centro Nacional de Computación Avanzada, Costa Rica), Reinhardt Pinzón (Universidad Tecnológica de Panamá, Panamá)
Instructional Code Editing Using Transformer Models, Yadiel Mercado (University of Puerto Rico at Rio Piedras, Puerto Rico), Michael Alvarez (University of Puerto Rico at Rio Piedras, Puerto Rico)
High Performance Computing (HPC) Applied in a Hydrological Study in Panama: The Case of the Upper Watershed of the Chagres River, Melanie Quiroz (Universidad Tecnológica de Panamá, Panamá), Miguel Salceda (CEMCIT-AIP, Panamá), Yolanda Vázquez (Universidad Tecnológica de Panamá, Panamá), Milena Zambrano (Universidad Tecnológica de Panamá, Panamá), Iris Arjona (Universidad Tecnológica de Panamá, Panamá), Xavier Trujillo (Universidad Tecnológica de Panamá, Panamá), Javier Sánchez-Galán (CEMCIT-AIP, Panamá), José Fábrega (CEMCIT-AIP, Panamá), Reinhardt Pinzón (CEMCIT-AIP, Panamá), Johansell Villalobos (Centro Nacional de Alta Tecnología, Costa Rica), Esteban Meneses (Centro Nacional de Alta Tecnología, Costa Rica) , Carlos Rudamas (Universidade de El Salvador, El Salvador)
Computer Vision and Artificial Intelligence in Sports Performance Analysis, Gabriel Torres (University of Puerto Rico at Rio Piedras, Puerto Rico), Carlos Vazquez (University of Puerto Rico at Rio Piedras), Javier Osorio (Universidad Manuela Beltrán), Edusmildo Orozco (University of Puerto Rico at Rio Piedras, Puerto Rico), Michael Alvarez (University of Puerto Rico at Rio Piedras, Puerto Rico)

Venue: Talk of the Town

Location: Talk of the Town

Venue. Talk of the Town

9:20 AM - 10:40 AM
Chair: Silvio Rizzi

NUMA-Aware FIFO Scheduling: Optimizing Data Movement for the Montage Workflow
Subgroup and SIMD Optimization of RTM Kernels in Intel SYCL for Portable Performance
Leveraging Local Data Share for Efficient Stencil Computation in the Fletcher Model on AMD MI250X
Fast Sorting for the RISC-V ‘V’ Vector Extension

Authors. Aurelio Vivas (Universidad de los Andes); Harold Castro (Uniandes)

Abstract. High-performance computing systems are essential for efficient scientific work flow execution. However, integrating workflow scheduling algorithms with non-uniform memory access (NUMA) architectures remains largely unexplored. This work extends an existing FIFO scheduling algorithm by incorporating NUMA awareness to reduce data movement. The approach leverages a runtime system that uses the Portable Hardware Locality (hwloc) library to map memory topology and collect task execution metadata, such as core and memory locality, which the scheduler uses to make NUMA- and data locality-aware decisions. The proposed scheduler achieved 64.72%, 65.79%, and 70.17% local read accesses for the 619-, 310-, and 58-task Montagework flows,respectively, improving average task read times and balancing the distribution of tasks, data, and memory accesses. These results demonstrate the effectiveness of the strategy, particularly for larger workflows.

Authors. Cristiano Alex Künas (Federal University of Rio Grande do Sul - UFRGS); Gabriel Freytag (Federal University of Rio Grande do Sul - UFRGS); Everton Paulino (Intel Labs); Fabio Zuvanov (Intel Labs); Alexandre Sardinha (Petrobras); Philippe Navaux (Federal University of Rio Grande do Sul - UFRGS); Alexandre Carissimi (Federal University of Rio Grande do Sul - UFRGS)

Abstract. Reverse Time Migration (RTM) is a key method for seismic imaging, but it demands substantial computational power and memory. With the growing diversity of GPU architectures, ensuring performance portability while maintaining energy efficiency has become a major challenge.This work evaluates RTM simulations on the Intel Max 1100 GPU, comparing three implementations: a baseline version, a SIMD-optimized version tailored for Intel’s architecture, and a subgroup-based version designed for portability. Experiments across multiple 3D grid sizes and simulation durations assess performance, energy consumption, and computational efficiency. Results show that SIMD optimizations offer the best performance (up to 4.6×) and energy savings (up to 36%) but limit portability. In contrast, subgroup-based optimizations strike a balance, delivering speedups over the baseline while maintaining broader hardware compatibility. These findings suggest practical strategies for deploying efficient and portable RTM simulations, particularly valuable for heterogeneous HPC environments in cloud platforms and collaborative industrial workflows where code sustainability and adaptability are critical.

Authors. Arthur Lorenzon (Federal University of Rio Grande do Sul - UFRGS); Alexandre Sardinha (Petrobras); Philippe Navaux (Federal University of Rio Grande do Sul - UFRGS)

Abstract. As high-performance computing systems scale, maximizing both performance and energy efficiency has become critical, particularly for memory-bound stencil computations that dominate scientific applications. In this scenario, we investigate optimization techniques for the Fletcher model, a high-order finite-difference seismic wave propagation solver, on AMD GPUs. We consider the impact of two hardware-level strategies: (i) exploiting Local Data Share (LDS) memory to reduce redundant global memory accesses in stencil kernels, and (ii) applying non-temporal memory instructions to avoid cache pollution from low-reuse coefficient arrays. The optimizations target the two main kernels of the model, partialDerivatives and propagation, by reusing shared memory in the y- and z-directions, improving spatial locality, and reducing memory traffic. Through a set of experiments performed on an AMD MI250X, we show that using LDS improves performance and energy by 9.3% and 12% compared to the baseline version that exploits the use of global memory. Also, when considering both LDS and non-temporal stores together, the performance gains increase to 31.9% while the energy savings increase to 27%, compared to the baseline version.

Authors. Daniel Salmun (Universidad de Buenos Aires); Esteban Mocskos (Universidad de Buenos Aires & CONICET)

Abstract. The increasing adoption of the RISC-V instruction set architecture (ISA), particularly its “V” Vector (RVV) extension, introduces new opportunities and challenges for software optimization. This paper tackles high-performance sorting on RISC-V processors by developing a native, in-place, single-threaded vectorized Quicksort algorithm specifically optimized for the RVV extension. We investigate key optimization techniques, including (1) LMUL-aware register grouping to maximize throughput, (2) arithmetic workarounds for RVV’s lack of interleaving instructions, and (3) strategic instruction reordering to mitigate hazards. Comprehensive experimental evaluations on a SpacemiT K1 processor (RV64GCVB, 256-bit VLEN) demonstrate superior performance: for large datasets of 32-bit integers, our implementation significantly outperforms all compared state-of-the-art sorting implementations, achieving a speedup of up to 1.89 times over the fastest alternative. These findings highlight RVV’s potential for computationally intensive tasks and offer insights into overcoming its unique architectural challenges, such as implementation-defined register widths and variations in available vector instructions compared to other SIMD platforms.

Served in the Talk of the Town

11:20 AM - 12:00 PM
Chair: Arthur Lorenzon

ACCLAIM: Accelerating Long Context LLM Inference on Heterogeneous Edge Platforms
A Scientific Data Integrity system based on Blockchain

Authors. Rakshith Jayanth (University of Southern California); Yi Chien Lin (University of Southern California); Souvik Kundu (Intel Labs); Deepak A Mathaikutty (Intel Labs); Viktor Prasanna (University of Southern California)

Abstract. With the growing deployment of Large Language Models (LLMs) on edge platforms, supporting long-context inference has become increasingly important. However, achieving low inference latency remains challenging due to high computational and data transfer demands, combined with limited compute and memory resources. In this work, we present ACCLAIM, a novel system designed for long-context inference on heterogeneous edge platforms. ACCLAIM addresses key challenges by: (1) segmenting the prefill stage to generate and access KV cache in chunks, reducing memory footprint; (2) leveraging the inherent sparsity in self-attention to lower computational complexity; and (3) performing offline profiling to determine optimal chunk size and generate an efficient load-balancing strategy across heterogeneous cores. These optimizations significantly reduce Time To First Token (TTFT). We evaluate ACCLAIM on two state-of-the-art heterogeneous platforms, showing up to 23× speedup in TTFT over standalone chunked prefill and up to 1.5× average speedup over state-of-the-art mapping algorithms.

Authors. Gian Sebastian Mier Bello (Universidad Industrial de Santander - UIS); Carlos Jaime Barrios Hernandez (Universidad Industrial de Santander - UIS); Alexander Martínez Méndez (Universidad Industrial de Santander - UIS); Robinson Rivas (Universidad Central de Venezuela); Luis Alberto Nuñez Villavicencio (Universidad Industrial de Santander - UIS)

Abstract. In most High Performance Computing (HPC) projects nowadays, there is a lot of data obtained from different sources, depending on the project's objectives. Some of that data is very huge in terms of size, so copying such data sometimes is an unrealistic goal. On the other hand, science requires data used for different purposes to remain unaltered, so different groups of researchers can reproduce results, discuss theories, and validate each other. In this paper, we present a novel approach to help research groups to validate data integrity on such distributed repositories using Blockchain. Originally developed for cryptographic currencies, Blockchain has demonstrated a versatile range of uses. Our proposal ensures 1) secure access to data management, 2) easy validation of data integrity, and 3) an easy way to add new records to the dataset with the same robust integrity policy. A prototype was developed and tested using a subset of a public dataset from a real scientific collaboration, the Latin American Giant Observatory (LAGO) Project.

Moderator: Addison Snell, Intersect360 Research

Panelists:

Bernd Mohr, Jülich Supercomputing Centre
Verónica Melesse Vergara, Oak Ridge National Laboratory
Carlos Jaime Barrios Hernandez, SCALAC

Best Poster Award "Comparative Performance Analysis of DNA Sequence Encoding Methods for Machine Learning-Based Bacterial Classification"
Diego Santibáñez Oyarce, Esteban Gómez Terán, Jorge Vergara-Quezada, Ana Moya-Beltrán

Best Paper Award
"Fast Sorting for the RISC-V ‘V’ Vector Extension"
Daniel Salmun, Esteban Mocskos

1:00 PM - 1:10 PM