1st Workshop on High Performance Fabrics

31st IEEE International Conference on High Performance Computing, Data, & Analytics

Dec 18, 2024
Bengaluru, India

Workshop on High Performance Fabrics (FABRICS 2024)

In conjunction with the 31st IEEE International Conference on High-Performance Computing, Data, & Analytics (HiPC 2024)

Description

The word “Fabrics” is perhaps the most used or abused word in Computing today. A Fabric is, as its name implies, a web of multiple devices in a cluster using some physical level connectivity with transport and protocol layers built on top of it. Computing, in general, and High Performance Computing in particular have moved over the years from proprietary cluster interconnects to more popularly, InfiniBand and Ethernet.

However, the communication fabric closer to the CPUs or Accelerators (including GPUs) is where the focus has been of late with the AI era. Compute Express Link (CXL) is a memory semantic fabric that has been developed to solve multiple challenges in computing related to capacity and bandwidth of memory, as well as the need to disaggregate memory. When it comes to connecting GPUs with each other, NVLink has become the most popular communication fabric. However, there have been recent advancements in this space with the formation of the Ultra Accelerator Link, or UALink, consortium to connect accelerators and allow sharing of memory between them and CPUs. Ethernet is further being enhanced by the Ultra Ethernet Consortium (UEC) to build a low-latency, lossless variant of Ethernet.

The Fabrics for HPC/AI workshop will be a half-day workshop with the aim of exploring the novel research ideas around this very rich space of fabrics and the research happening in operating systems, virtualization and manageability in related areas.

 

Keynote 1 : Fabric Challenges in AI Clusters

Speaker : Manoj Wadekar, Meta

Abstract : This talk will focus on the following areas :

  • AI clusters are creating new fabrics in Datacenters.
  • AI Fabrics pushing network, memory, chip-to-chip, die-to-die interconnects
  • Exponential growth in AI clusters and Fabric limitations driving unique data center solutions creating opportunities for innovation.

Bio : Manoj Wadekar (Meta) is a leading figure in Meta’s technology division, focusing on advanced computing solutions and AI integration. His expertise in system architecture and data management has driven significant innovations in Meta’s infrastructure, enhancing the performance and scalability of their computing systems.

 

Keynote 2 : Harnessing power of advanced next-gen fabrics to break memory wall with TCO optimized solution

Speaker : Vishal Tanna, Micron

Abstract : This talk will cover the following areas

  • Fabrics for Memory and Their Necessity
    • The memory wall problem and the solutions being explored to address it.
    • Evolution of CXL and how it can help mitigate the memory wall issue.
  • Workload Analysis
    • Analysis of different workloads that are either memory capacity or bandwidth bound.
  • CXL for Pooling and Sharing
    • Investigation into how CXL supports resource pooling and sharing, and the role of FAMFS in these features.
  • Emerging Fabrics
    • Introduction to UALink and its benefits.

Bio:  Vishal is the Director of engineering at Micron. He is currently leading the R&D teams working on various aspects of CXL interface-based product development.  Previous to this, he lead the team which delivered Micron’s first ever PCIe based NAND flash controller to complex enterprise NVMe SSDs. He is an expert in fabrics like PCIe, CXL and NVMe

 

Accepted Papers

TitleAuthors and Affiliation
Study of CXL Memory Sharing with FamFS and its Use casesAravind Ramesh (Micron), Jonh Groves (Micron).
Porting of OpenSM over Trinetra-A Switchless Torus NetworkV Evancer Vino John, Mahesh Chaudhari, Yogeshwar Sonawane and Sanjay Wandhekar (CDAC).
Simulation-Driven Design of Large-Scale Systems Architecture Using Slingshot FabricsShridhar Joshi, Sumant Kalra [Hewlett Packard Enterprise].
Performance optimization on CXL products using in-house modeling and simulation toolchainKirthi Ravindra Kulkarni, Anandhavel Nagendrakumar, Rohit Sehgal, Eishan Mirakhur, Nikesh Agarwal, Chandana Manjula Linganna [Micron] and Ranjit Gupte [Microchip]
Towards Continuous Checkpointing for HPC Systems Using CXL

Ellis Giles,

Peter Varman [Rice Univ.]

 

Workshop Schedule

TimeTopicSpeaker and Affiliation
8.30-8.45 a.mWelcome and Workshop IntroductionMohan Parthasarathy, Hewlett Packard Enterprise
8.45-9.40 a.mKeynote 1 : Fabric Challenges in AI ClustersManoj Wadekar, Meta
9.40-10.05 a.mStudy of CXL Memory Sharing with FamFS and its Use casesAravind Ramesh, Micron
10.05-10.30 a.mPorting of OpenSM over Trinetra-A Switchless Torus NetworkYogeshwar Sonawane, C-DAC
10.30-11.00 a.mBreak
11.00-11.55 a.mKeynote 2 : Harnessing power of advanced next-gen fabrics to break memory wall with TCO optimized solutionVishal Tanna, Micron
11.55-12.20 p.mSimulation-Driven Design of Large-Scale Systems Architecture Using Slingshot FabricsShridhar Joshi, Hewlett Packard Enterprise
12.20-12.45 p.mPerformance optimization on CXL products using in-house modeling and simulation toolchainRohit Sehgal, Kirthi Ravindra Kulkarni, Micron
12.45-1.10 p.mTowards Continuous Checkpointing for HPC Systems Using CXLEllis Giles, Rice University
1.10-1.15 p.mVote of Thanks

Important Dates

Submission site opens:  September 1, 2024

Paper submission due:  September 30, 2024

Author notification:  October 30, 2024

Camera ready version submission:  November 15, 2024

 

Organizers

Workshop Chairs :

Mohan Parthasarathy, Hewlett Packard Enterprise

Sunita Jain, AMD

Publicity Chair  :

Dr Badrinath Ramamurthy, IIITB

 

Program Commitee Chair  :

Ajay Joshi, Micron

 

Technical Program Committee Members :

KV Subramanian, AMD

Sunil VL, Ventana

Ritesh Prasad Raturi, Micron

Sparsh Mittal , IIT Roorkee

Navin Bishnoi, Marvell

Maruf, Hasan,  AMD

Hemangee Kalpesh Kapoor, IIT Guwahati

Brian Hirano

Sridhar Muthrasanallur

YR Ananda, IIITB

Questions may be sent to [email protected]

Call for Papers:  

Papers are solicited from the areas, including, but not limited to:

 Hardware Architectures

  • Memory Expansion and Pooling (via technologies like CXL)
  • Ethernet challenges as a High Performance Fabric
  • Fabrics for next gen AI/ML workloads

Software Architectures

  • CXL emulation
  • Operating system support for Tiered Memory
  • Virtualization for Fabrics

Applications and Use-Cases

  • Benchmarking Applications
  • Workload characterization
  • Analysis and Profiling of Fabric Performance

Control Plane Software for Management

  • Fabric Management e.g.  RedFish (DMTF)/OFA (Open Fabrics)
  • In-band and OOB Management of Fabric devices (CXL/Ethernet/Accelerators)

Paper Submissions: Using the main website

General guidelines: Manuscripts submitted to FABRICS 2024 should not have been previously published or be under review for a  different workshop, conference or journal. Abstracts should contain no more than 300 words and must be submitted along with the paper by the paper submission deadline. The title and abstract submitted by this deadline should have sufficient detail and not just be a placeholder. Submissions should have the final list of authors, as changes may not be feasible later. Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors. 

Length of submission: Submitted manuscripts can be full papers or short papers.

  • Full papers may not exceed six (6) single-spaced double-column pages using 10-point size font on 8.5×11 inch pages (IEEE conference style) plus one extra page for references. 
  • Short papers may not exceed four (4) plus one extra page for references in the same format. 

Single blind policy: All submitted manuscripts will be reviewed by the Program Committee under a single-blind review process, so the submitted paper should NOT list any authors or their affiliations.

Submission link: Submit your paper at:

HiPC 2024 (31st IEEE International Conference on High Performance Computing, Data and Analytics – Fabrics Workshop) (easychair.org) 

HiPC 2024 is the 31st edition of the IEEE International Conference on High Performance Computing, Data, and Analytics. It will be an in-person event in Bengaluru, India, from December 18 to December 21, 2024

IEEE Conduct and Safety Statement

IEEE believes that science, technology, and engineering are fundamental human activities, for which openness, international collaboration, and the free flow of talent and ideas are essential. Its meetings, conferences, and other events seek to enable engaging, thought provoking conversations that support IEEE’s core mission of advancing technology for humanity. Accordingly, IEEE is committed to providing a safe, productive, and welcoming environment to all participants, including staff and vendors, at IEEE-related events.

IEEE has no tolerance for discrimination, harassment, or bullying in any form at IEEE-related events. All participants have the right to pursue shared interests without harassment or discrimination in an environment that supports diversity and inclusion.

Participants are expected to adhere to these principles and respect the rights of others. IEEE seeks to provide a secure environment at its events. Participants should report any behavior inconsistent with the principles outlined here, to on site staff, security or venue personnel, or to [email protected].

Diversity and Inclusion

HiPC is committed to the promotion of diversity and inclusion in all professional activities. We encourage the diversity and welcome everyone regardless of age, gender identity, race, ethnicity, socioeconomic background, country of origin, religion, sexual orientation, physical ability, political views, education, and work experience. 

Follow us on: