Abstracts
On Consciousness and Computation
Gianfranco Bilardi
University of Padova, Italy
Consciousness is a fundamental property generally attributed to human beings. Perhaps, more than intelligence, consciousness is what characterizes a human being as an ethical subject. Whether or not machines can be conscious may ultimately determine the position of Homo Sapiens in future history.
In the past, the nature of consciousness has been investigated mostly by philosophers and considered outside the scope of science. This state of affairs has begun to change in the last three decades or so, with a number of research programs explicitly pursuing the formulation of a scientific theory of consciousness. In several of the approaches, key ideas are borrowed for computer and information sciences.
This talk aims to briefly introduce the subject of consciousness and surveying a sample of the approaches currently pursued to understand it, with a focus on those aspects that may become the seeds of a new area of computer science research in the coming decades, most likely in interactions with other disciplines.
A review of processor advances over the past thirty years - an outsider’s perspective
Jan Eitzinger (Treibig)
University of Erlangen, Germany
This talk will provide an overview and analysis of the performance advances in processor architecture over the last thirty years. The focus will be on X86 chips that are used in HPC systems. Technical aspects as well as marketing and economic aspects will be discussed. In an analysis, basic in-core and memory hierarchy properties are examined. As a recent development, the Apple Silicon Core architecture is compared with current X86 variants. Finally, statistical data for the current performance requirements of a TIER2 academic HPC center are shown.
Circuits for the Discrete Fourier Transform: History and Perspectives
Carlo Fantozzi
University of Padova, Italy
The rise of domain-specific hardware accelerators has only emphasized the relevance of a long-standing question in computer science: How does one design the best machine for a given algorithm or computational problem? Theoretically, the question has been studied since the early 1980s by introducing the VLSI model of computation, which describes a VLSI circuit as a two-dimensional layout of input/output ports, computing gates, and wires. A fundamental result of these studies is that given a computational problem Pi there is a profound, quantitative link between the area of a VLSI circuit solving Pi, its execution time, and the information exchange of Pi: intuitively, it is the minimum number of bits that two parts of the circuit must exchange to solve an instance of Pi.
In this talk, we will present results on the information exchange (hence, on the area and time) for VLSI circuits computing the Discrete Fourier Transform (DFT). We will move from historical results to the present day. Then, we will introduce some unanswered questions that remain open for future research. The DFT is a relevant primitive in countless applications; as a matter of fact, some commercial chips are implementing it in silicon. Implementations usually rely on the radix-2 FFT algorithm, but this approach is arguable for domain-specific accelerators that must ensure good performance in different scenarios, hence for different DFT lengths. We will discuss connections between the theoretical properties of the DFT as a computational problem and the need for adaptivity in present and future chips.
Joint work with Gianfranco Bilardi.
A Proposal for Large‑Scale Graph Processing on Multi-FPGAs
Roberto Giorgi
University of Siena, Italy
Scaling up the performance of graph processing poses challenges due to the ever-increasing size of the graphs and the irregular nature of the related memory accesses. While CPU and GPUs are widely used as an easy-to-program approach, further optimization involves more flexible architectures such as Field-Programmable Gate Arrays (FPGAs). FPGAs are recognized as more efficient and performant solutions for generic data flows. However, careful optimizations need to be done to achieve more performance. We propose an improved multi-FPGA distributed architecture combined with an efficient partitioning scheme and an analytic performance model validated on FPGAs over a simple clustering scheme. We evaluated our model with a tailored PageRank algorithm. The FPGA solution can offer a 26x speedup compared to GPUs and other FPGA solutions. We projected a further 12x speedup using multi-FPGAs compared to the single FPGA. These improvements appear to be even more available at a larger graph scale.
Quantitative Codesign of Heterogeneous Architectures through ML-based ModSim
Adolfy Hoisie
Brookhaven National Laboratories, USA
The talk will focus on recent advances in an ongoing project aiming at developing machine learning (ML) techniques for architecture modeling and simulation (ModSim), as a new alternative to the existing ModSim methods and tools. Specifically, the ability to cope with heterogeneous architectures for complex workflows will be emphasized. A new frontier of ML-based ModSim applicability to codesign in a dynamic (at runtime) regime based on novel techniques for performance representation will be discussed.
Quantum computing in the manufacturing industry
Thomas Husslein
OptWare GmbH, Germany
Many expectations are attached to quantum computing and much of the technology is still undecided which path to take. In this talk, I would like to show how we are now examining different use cases in manufacturing and logistics for their suitability with respect to different quantum computing approaches and thus get a better assessment of their disruptive potential.
Building the Infrastructure Memex: VU on Operational Data Analytics in the 21st Century
Alexandru Iosup
Vrije Universiteit Amsterdam, The Netherlands
Our society has turned digital: From science to business, from online shopping to online gaming, from education to government, digital applications depend every moment on Information and Communication Technology (ICT) infrastructure. To pilot this infrastructure and navigate through it, we need detailed information and decision-making support, provided on-time, cheaply, and reliably. (We briefly argue non data-driven approaches are currently unable to cope with real conditions.)
In this talk, we argue that Operational Data Analytics (ODA) can provide these capabilities, but requires advances not only in monitoring and observability, but also in data sourcing and ontology mapping, data cleaning and filtering, data characterization and modeling, and process management. Furthermore, we argue a digital twin, capable to simulate both the current conditions and to make long-term predictions, should be integrated with the ODA system - providing an advanced form of A - to help with resource management and scheduling, pipeline optimization, energy awareness, general system tuning, capacity planning, etc.
We present a reference architecture for ODA, a partial analysis of the state of the art, and experience with data collection and its use in digital twinning of datacenters. This work aligns with and benefits from work in collaboration with the SPEC RG Cloud Group and, among others, the EU Graph-Massivizer project and the OffSense and 6G Future Network Services projects in the Netherlands.
Programming Abstractions for Data Locality: Mapping the landscape of automation in locality optimisation
Paul Kelly
Imperial College, United Kingdom
Locality is fundamentally harder than finding parallelism. Yet locality is absolutely crucial for achieving performance, ever more so as we approach physical limits, and as energy becomes a first-order concern. We have developed many tools for tackling this challenge by hand - sometimes with some abstraction. For me though, the real way forward is to automate: to deploy algorithms that find a good locality+parallelism combination. This talk does not offer new research results - instead I aim to map out what we know. I'll illustrate with some examples from my own group's work and others. The main objective is to promote and to frame discussion at the workshop.
High-level parallel programming of heterogeneous parallel systems and clusters with SkePU 3
Christoph Kessler
Linköping University, Sweden
SkePU is a C++ based single-source high-level programming framework for portable execution on heterogeneous parallel systems, developed since 2010 as an open-source project at Linköping University. In this talk, we give a quick overview of its current, third-generation, SkePU 3, most of which was developed as part of the recent EU FET-HPC project EXA2PRO, and report on some ongoing work.
SkePU is an embedded DSL extending modern C++ with high-level constructs for portable modeling of parallelizable computations, based on algorithmic skeletons such as map, reduce, stencil etc., as well as abstractions for data structures and data access patterns that the skeleton calls operate on. For each skeleton, SkePU provides a set of different backends: for sequential and parallel execution on CPU, for execution on GPU, hybrid CPU-GPU computing, and parallel execution on clusters. SkePU can automatically select the backend/resource for each skeleton call based on a performance model, and automatically optimizes the execution flow for data movements and data locality.
Recent extensions of SkePU 3 include language and standard library extensions e.g. for deterministic, portable parallel pseudorandom number generation, new DNN operators, and a backend for FPGA execution.
Joint work with August Ernstsson from Linköping University and many others who contributed to SkePU over the years. See https://skepu.github.io for the source code, documentation, and list of all contributors..
Linear Probing Revisited: Tombstones Mark the Death of Primary Clustering
Bradley Kuszmaul
RelationalAI, USA
Now that Moore’s law has ended, we must investigate and teach other ways to achieve high performance. One avenue to higher performance is improved data structures and algorithms. As a concrete example, I’ll talk about recent work with Michael Bender and William Kuszmaul on linear probing in hash tables.
First introduced in 1954, linear probing is one of the oldest data structures in computer science, and due to its unrivaled data locality, it continues to be one of the fastest hash tables in practice. It is widely believed and taught, however, that linear probing should never be used at high load factors; This is because primary-clustering effects cause insertions at load factor 1−1/x to take expected time Theta(x^2) (rather than the ideal Theta(x)). The dangers of primary clustering, first discovered by Knuth in 1963, have been taught to generations of computer scientists, and have influenced the design of many widely used hash tables.
We show that primary clustering is not a foregone conclusion. We demonstrate that small design decisions in how deletions are implemented have dramatic effects on the asymptotic performance of insertions, so that, even if a hash table operates continuously at a load factor 1−Theta(1/x), the expected amortized cost per operation is Õ(x). This is because tombstones created by deletions actually cause an anti-clustering effect that combats primary clustering.
We also present a new variant of linear probing (which we call graveyard hashing) that completely eliminates primary clustering on any sequence of operations: if, when an operation is performed, the current load factor is 1−1/x for some x, then the expected cost of the operation is O(x). One corollary is that, in the external-memory model with a data blocks of size B, graveyard hashing offers the following remarkable guarantee: at any load factor 1−1/x satisfying x=o(B), graveyard hashing achieves 1+o(1) expected block transfers per operation. Past external-memory hash tables have only been able to offer a 1+o(1) guarantee when the block size B is at least Omega(x^2).
Multiplying 2 × 2 Sub-Blocks Using 4 Multiplications
Yoav Moran
Hebrew University, Israel
Efficient implementations of recursive matrix multiplication switch to the cubic time classical algorithm on small sub-blocks as the classical algorithm requires fewer operations on small blocks. We obtain an algorithm that can outperform the classical one, even on small blocks, by trading multiplications with additions. To this end, we introduce commutative algorithms that generalize Winograd’s folding technique and combine it with fast matrix multiplication algorithms. Thus, when a single scalar multiplication requires rho times more clock cycles than an addition, our technique reduces the computation cost of multiplying the small sub-blocks by a factor of (rho + 3) / (2 rho + 2 + o(1)) compared to using the classical algorithm. Our technique also reduces the energy cost of the algorithm, as the rho values for energy costs, are typically even larger than the values for arithmetic costs.
Specifically, we obtain an algorithm for multiplying 2 × 2 blocks using only four multiplications. This algorithm seemingly contradicts the lower bound of Winograd (1971) on multiplying 2 × 2 matrices. However, we obtain this algorithm by bypassing the implicit assumptions of the lower bound. We provide a new lower bound matching our algorithm for 2 × 2 block multiplication, thus showing our technique is optimal.
Joint work with Oded Schwartz.
The Relentless Pursuit of Performance
Jose Moreira
IBM, T. J. Watson Research Lab., USA
The fastest machine in the June 1993 TOP500 list had a maximum performance score (Rmax) of 60 Gflops. Fast forward to June 2023 and the fastest machine has a maximum performance score of 1.2 Exaflops. Despite the 20-million-fold improvement in maximum performance, we can claim that these two machines are more similar than dissimilar. In this talk, we will summarize the main factors that have contributed to this performance improvement and also discuss what are the main commonalities between the systems that have occupied the top spot in the TOP500 lists over the past three decades. We will discuss other metrics of top performance and venture into predictions of what the future may hold for these top-performing computer systems, including physical limits to their performance.
A view on 40 years of HPC Performance: Critical Ingredients to Complex Scientific Solutions
Wolfgang Nagel
Technische Universität Dresden, Germany
Statistical learning techniques to efficiently identify central nodes from large graphs
Leonardo Pellegrina
University of Padova, Italy
Identifying important components of a graph is crucial for many relevant tasks, such as finding communities in social and biological networks, identifying vulnerable nodes that may be attacked to disrupt its functionality, and for influence maximization. To this aim, several centrality measures have been proposed to quantify the importance of nodes and sets of nodes. Exact algorithms to compute such metrics poorly scale when applied to large graphs; on the other hand, while a few recent works make use of random sampling to obtain approximated solutions with guarantees, their tradeoff in terms of accuracy and computational requirements is still very loose, making the analysis of large graphs expensive in practice.
This talk will outline our recent advances on this problem, the SILVAN and CentRA algorithms. Our novel methods are based on the development of sharper accuracy guarantees based on Monte Carlo Rademacher Averages and the VC-dimension, key tools from statistical learning theory, to compute rigorous approximations of the centrality of individual nodes and groups of nodes of a graph.
In practice, these novel methods allow the identification of central nodes from large graphs using a fraction of the resources of state-of-the-art methods (e.g., with up to two orders of magnitude of speedup).
Generating Efficient Kernels for Quantized Inference on Large Language Models
Tommaso Pegolotti
ETH, Switzerland
Data movement has become a major bottleneck in many software computations. Thus, a new avenue to improving performance is data compression combined with low-precision arithmetic. A prominent example is the domain of machine learning, where models have been quantized to low-bit integer values without significant loss of precision. Due to the success of these methods, HPC-grade GPUs and newer server CPUs are now equipped with native support for 4-bit and 8-bit integer mathematical operations. For other precisions, complex pack and unpack operations need to be implemented by hand.
In this work, we show how to leverage quantized arithmetic for high inference throughput on open-source large language models (LLMs) using off-the-shelf CPUs. To do so, we present a program generator for quantized matrix multiplication kernels in 4, 3, and 2-bit precision run via Pytorch. This approach provides significant performance improvements over straightforward implementations.
What have we learned from 50 years of parallel programming research?
Keshav Pingali
University of Texas at Austin and Katana Graphs, USA
Parallel programming started around 1970 so as a discipline, it is now more than 50 years old. What have we learned in the past 50 years about parallel programming? What problems have we solved and what problems remain to be solved? This talk presents a personal point of view about these and other questions regarding the state of parallel programming.
IOUB: A tool for automatically computing and maximizing the operational intensity of affine programs. Can it be used to optimize neural networks?
Fabrice Rastello
INRIA, France
Evaluating the complexity of an algorithm is an important step when developing applications, as it impacts both its time and energy performance. Computational complexity, which is the number of dynamic operations regardless of the execution order, is easy to characterize for affine programs. Data movement (or, I/O) complexity is more complex to evaluate as it refers, when considering all possible valid schedules, to the minimum required number of I/O between a slow (e.g. main memory) and a fast (e.g. local scratchpad) storage location. IOOpt is a fully automated tool that automatically bounds the data movement of an affine (tillable) program. Given a tillable program described in a DSL, it automatically computes: 1. A lower bound of the I/O complexity as a symbolic expression of the cache size and program parameters (IOLB part); 2. An upper bound that allows one to assess the tightness of the lower bound (IOUB part). This talk will focus on IOUB, and its corresponding upper-bound algorithm that can provide the programmer with a useful tiling recommendation, even for multi-level caches.
Dense neural networks are composed of simple affine operators and fine tuning of hierarchical tile sizes of each single operator is crucial for getting high-performance code. Operational intensity seems to be the perfect proxy for this optimization problem. However, state-of-the-art auto-tuning techniques such as those used in Halide, Ansor or Tiramisu prefer using over-fitting models with an extensively large number of poorly engineered metrics. By considering operational intensity as an exemplar, the objective of the remaining part of the talk is to initiate discussions concerning issues of using aggregated metrics for compiler optimization.
Towards automated compiler support for distributed-memory parallelism with tensor computations
Saday Sadayappan
University of Utah, USA
While numerous research efforts have addressed automatic parallelization for shared-memory parallel computing (including multicore CPUs and GPUs), few have addressed the automatic partitioning and scheduling of data and computations for distributed-memory parallel systems. This talk describes a new approach to automated compiler optimization for distributed-memory parallel systems, of sequences of tensor computations, such as those arising in machine learning and high-order methods in quantum chemistry. A space of possible mappings for a tensor computation and its data onto a multidimensional processor grid is modeled using a collection of 0/1 variables and a constrained optimization problem to minimize data movement volume is solved using the Z3 SMT solver.
SHRAY - an Owner-Compute Distributed Shared Memory System
Sven-Bodo Scholz
Radboud University, Nijmegen, The Netherlands
In this talk, we present a new variant of distributed shared memory which, from the programmer’s perspective, behaves like shared memory for arbitrary reads but restricts the address ranges that individual nodes can write to a range of addresses that they “own”. Based on our open source implementation named SHRAY (SHared arRAY), we show how this design hits a sweet spot in the design space regarding programmability of parallel applications, implementation complexity, and achieved parallel performance.
Between ML and HPC, where does (whitebox) performance modeling fit?
Y.C. Tay
National University of Singapore
Machine learning (ML) models can now exert severe demands on high performance computing (HPC) systems. To scale up performance, it is natural in this ML era to adopt a blackbox model of such an increasingly complicated combination of hardware, software and workload. What role(s) can whitebox models play in performance analysis of these systems? The answer is relevant when considering what education students and engineers should get in performance analysis, to improve their intuition for system performance. Such training has to factor in the constraints imposed by curriculum structure, and the students’ mathematical preparation and shifting interests.
Performance by Experiment: Enabling Algorithmic and Procedural Experimentation at HPC Scale
Peter Thoman
University of Innsbruck, Austria
Extracting competitive performance from HPC hardware with traditional methods – e.g. explicit message passing and low-level GPU computing – necessitates high levels of expertise throughout the hardware and software stack. Optimizations frequently reduce performance portability and implementations are rigid.
Modern avenues toward performance include methods that operate by automated exploration of a potential space of optimizations at an algorithmic and procedural level. However, as long as achieving competitive performance requires a highly specific, tuned low-level codebase, it is difficult to automatically explore fundamentally different implementation options without very significant and domain-specific time investment.
One option for enabling these new optimization ideas for at least parts of the overall high-performance computing landscape are higher-level interfaces, which allow for a larger set of potential implementation strategies to be explored without changes to the user code. However, such interfaces are only truly valuable in HPC when they offer competitive performance at scale.
In this talk, we will present two core components of a runtime system suitable for the purpose of exploring implementation options of programs specified at a high level of abstraction with competitive performance: (i) a completely distributed, asynchronous execution model, and (ii) automatic discovery and optimization of collective communication patterns in the high-level representation. We also provide an outlook on how such a system can enable rapid exploration of data distribution strategies.
On the necessity of teaching chip design in Europe
Carsten Trinitis
Technical University of Munich, Germany
Both the European Union and national governments are aware of the fact that Europe is falling behind in terms of chip design, especially microprocessor design. Several initiatives have started in that direction, with the European Processor Initiative (EPI) probably being the most prominent one. However, in order to achieve the commissions’ ambitions goals, it is of crucial importance to teach students how to design a chip.
For this reason, the newly created “Information Engineering" program at TUM’s campus in Heilbronn will introduce the topic starting with a course on hardware development languages. This is part of BB-KI, which is a joint project between TUM and the University of Potsdam and funded by the German Ministry of Education and Research.
The talk will introduce BB-KI and give an overview of the activities starting at TUM Campus Heilbronn.
Reflections on the Past & Various Musing on the Future
Henry Tufo
University of Colorado Boulder, USA
We look back on predictions from our previous ScalPerf talks and try to provide some insight on where HPC is headed in the next decade or so. We leverage our long history developing scientific applications and more recent work on technology tracking, experimental systems development, and designing, maintaining, and operating some of the largest supercomputers in the world, in particular through our partnership with TACC at U. Texas on the Stampede family of systems.
A Dichotomic Race: a Perspective of HPC for Quantum Computing
Marzio Vallero
University of Trento, Italy
The frontier of quantum computing (QC) simulation on classical hardware is quickly reaching the hard scalability limits for computational feasibility. Nonetheless, there is still a need to simulate large quantum systems classically, as the Noisy Intermediate Scale Quantum (NISQ) devices are yet to be considered fault-tolerant and performant enough in terms of operations per second. The various simulation techniques each boast specific limitations. The exponential memory requirement of the state vector simulation, when compared to the qubit register sizes of currently available quantum computers quickly saturate the capacity of the top HPC machines currently available. Tensor network contraction approaches still fall short of the inherent complexity of finding an optimal contraction path. The proposed talk aims to describe the limits of the current state-of-the-art simulation techniques, delving into the topic of quantum circuit benchmarking and how it must be adapted in the context of high performance classical simulations. The performance measures of a set of simulations are then correlated with selected metrics, identifying the main reasons behind the observed trends.At last, some applications for quantum simulation on classical hardware is discussed. A special focus is given to simulation approaches aimed at improving logical level reliability of quantum circuits through the logical modeling of quantum hardware faults.
Performance by co-design
Ana Varbanescu
University of Twente, Enschede, Netherlands
Co-design has gained popularity as a method for IC design, in an attempt to specialize circuits for applications and gain performance or efficiency (e.g., lower area, higher clock speed, etc.). In this talk, I argue we can apply the co-design principle at coarser granularity, and design compute systems for large-scale applications, ideally with performance guarantees. I will provide several successful co-design examples from previous research (some of it my own), and present a few necessary (but probably not sufficient) conditions - in terms of methods and tools - for co-design to be(come) a feasible and usable approach for efficient system and application design.
A unified affinity-aware Parallel Programming Model
Josef Weidendorfer
Technical University of Munich, Germany
For large-scale HPC applications, the established parallel programming models are OpenMP and MPI, typically combined. OpenMP covers in-node parallelization and MPI cross-node communication and cluster-wide parallelization. With upcoming heterogeneity, multiple memory spaces with different characteristics will result in further increased complexity. In this talk, I argue for a programming model whose runtime controls the partitioning and allocation of global data containers declared by an application. Required communication is specified by declaring overlapping partitions and asking for value consistency at given points in time. During computational phases, only memory local to compute units is accessed. I explain how this abstraction can cover parallelization at different magnitudes of a compute system with the same interface, but still allow tuning for cases like hardware-exposed shared memory vs. distributed memory. More concretely, I present 2-copy, 1-copy, and 0-copy schemes which result in performance behavior similar to MPI, PGAS, and OpenMP implementations, respectively.