DASH Workgroup Meeting Minutes 3/6/2024

31 views
Skip to first unread message

Kristina Moore

unread,
Mar 11, 2024, 7:27:50 PM3/11/24
to sonic...@googlegroups.com, SONiC DASH/SmartSwitch devs, VNET Leads, Sushant Sharma, Riff Jiang, Andy Fingerhut, Alberto Gonzalez Prieto, Marian Pritsak, Kamil Cudnik, Chris Sommers, Ze Gan, Agrawal, Ashutosh, Cristian Dumitrescu, Srikanth Kandula, Anish Narsian, Apurva Shah, Arun Jeedigunta, Deven Jagasia, Dvir Shamay, Jae Park, James Grantham, Kalyan Kumar Gokavarapu, Ram Kakani, Ravindran Suresh, Renuka Manavalan, Rita Hui, Rohit Jain, Steve Espinosa, Suresh Kumar Nedunchezhian, Tao Deng, Tommaso Pimpo, Xin Liu (CLOUD), Yusef Skinner, Yanzhao Zhang, Mario Baldi, Moopath velayudhan, Mukesh, Thyamagundalu, Sanjay, Veerappan, Senthilnathan, Narayanan, Swaminathan, Srinivasan, Vijay, Dean Lee, Alberto Villarreal, Manodipto Ghose, Mircea Dan Gheorghe, Nitesh Jha, Vinod Kumar, Carol Ann Slater, Jai Kumar, Mohammad Hanif, Sandeep Balani, Amith, Ixim, Nitesh, Ravi, Venkat External, Yoyo, Richard Wu, Liat Grozovik, Oleksandr Ivantsiv, Eilon Greenstein, Matty Kadosh, Nikhil Sandugula, Paul Cummins, Yohad Tor, Dermody, Anthony, Deb Chatterjee, John Andy Fingerhut, McCollum, Macy, Marangwanda, Shingi, Greer, Shan A, Bud Grise, John C Carney, Vincent L, Faisal Khan, Renato Recio, Saad Mazhar, Zafir, Joseph White, Mark Sanders, Phaniraj Vattem, Shawn Dube, Aravind Srikumar (arsrikum), Deepti Chandra (deeptich), Jack Sexton (jacsexto), murali Venkateshaiah (muraliv), Praveen Bhagwatula (pbhagwat), Rob Murphy (robermur), Madhu, Harrish SJ, Madhu, Eddie Ruan, Yuezhou, Zhuengbo, Zhuengbo2, Kishore Atreya, Kannan Selvaraj, Marc Meunier, Meyappan K, Reshma Sudarshan

Hello DASH Community –thank you for your time last Wednesday. 

We had @Srikanth Kandula from Microsoft Research present In Network Transformations for AI/ML and other Workloads.  This was an interesting and informative talk around development of an open API for in-network transformations, and early results.  The slides have been uploaded to the DASH GitHub if you would like to view them. 

               A green and blue background with white text

Description automatically generated

Intel @'Cristian Dumitrescu'  provided additional clarification around ‘range-match’ and the limitations of range support in P4 (rather than bmv2).  

We were also able to close 5 PRs/Issues this week (see project list).

For further reading, see full meeting notes and follow-ups below.

Thank you for your time/contributions ~ Kristina

As always, with respect to any of our DASH meetings  - Community (Wednesdays), or Behavioral Model (Thursdays) – please let me know if there are PRs, Q&A, or items you would like to discuss or present. 
The DASH YouTube channel link is here to subscribe / access WG content (and click the bell to receive notifications). 

See you on 3/13/2023!

Meeting Title:  SONiC-DASH-Workgroup Community Meeting #104

Attendees (20):

DASH Group to join: https://groups.google.com/g/sonic-dash     

DASH-Test-Workgroup Group to join: https://groups.google.com/g/sonic-dash-test-workgroup

 

Agrawal, Ashutosh - Intel

Guohan Lu - MSFT

murali Venkateshaiah - Cisco

Rob Murphy - Cisco

Alberto Villarreal - Keysight

Kristina Moore - MSFT

mxiao - Arista

Srikanth Kandula - MSFT

Chris Sommers - Keysight

Kumaresh Perumal - MSFT

Oleksandr Ivantsiv - NVidia

Sushant Sharma - MSFT

Don Ewald - Cisco

Mididaddi, Naren - Intel

Praveen Bhagwatula - Cisco

Veerappan, Senthilnathan - AMD

Dumitrescu, Cristian - Intel

Mircea Dan Gheorghe - Keysight

Riff Jiang - MSFT

Yakiv Huryk - NVidia

 

Discussion - PRs/Issues/Documentation for review, comments, suggestions

 

DASH Community Upcoming Project Action Items

 

A screenshot of a computer

Description automatically generated

 

 

 TL;DR Notes 😊

 

  • Christian (Intel) clarified the limitations of P4 DPDK for range support with the Community; issue is not due to bmv2, rather P4.
  • In Network Transformations for AI/ML and other Workloads:  In-network compute DASH API - Srikanth Kandula, Partner Researcher at Microsoft Research
    https://www.linkedin.com/in/srikanthkandula/
    • Early joint work with Rishabh, Deepak, Gerald and Srikanth
      • Implementations clean & clear
      • Providers able to deploy different implementations
      • App libraries to have a consistent coding interface

 

  • Srikanth introduced the motivation and objectives of developing an open API for in-network transformations for AI/ML workloads, such as multicast, aggregation and compression, and discussed the design choices and challenges with the community.
  • Feedback and questions for Srikanth:  the Community raised several questions and comments for Srikanth, such as the scalability and memory requirements of the transformations, the coexistence with different network protocols and stacks, the failover and load balancing mechanisms, and the data plane signaling and identification.
  • Q:  Robert Murphy (Cisco) -  scale needed for distribution trees?  Relates to memory consumption…
  • A:  Srikanth (MSR) - roughly 1 tree for ongoing collective.  Depends on the job and how long it spends time in the GPU or collective.  # or training jobs or inferences running in the cluster.

 


Follow-up tasks:

  • Network compute API:  Obtain information on how deep in the packet the NPU can look for the signaling field (@robe...@cisco.com)
  • Network compute API:  Send the meeting series to Srikanth and forward any feedback from the community (Kristina)
  • Upload presentation slides to GitHub - done

 

A green and blue background with white text

Description automatically generated

 

 

A river running through a city

Description automatically generated

 

 

A diagram of a number

Description automatically generated with medium confidence

 

Each receiver needs to receive other 3 packets; multicast protects the out-link

Math looks like matrix multiplier for an outcome of a smaller vector

Key takeaway is that good results are being seen

 

A white background with black text

Description automatically generated

 

Think of these as axioms

Device doing this needs to maintain the state

 

A white background with black text

Description automatically generated

 

 

A screenshot of a computer

Description automatically generated

 

A close-up of a computer

Description automatically generated

 

 

A close-up of a computer screen

Description automatically generated

 

 

A screenshot of a computer

Description automatically generated

 

 

 

Behavioral Model WG - March 7 2024 - https://youtu.be/DveEx4nxgug

 

HA APIs from Riff from a SAI perspective to support DASH HA and ENI level HA

https://Github.com/r12f/DASH

 

Riff presented the high-level concepts, data structures, packet flow and workflows for the DASH HA API design and answered questions from Intel, and Keysight.  The API is intended to support different HA modes and scenarios and is being vetted by some backchannel stakeholders.

 

DASH HA P4 implementation: Riff also shared his progress on the P4 implementation of the DASH HA behavior model and the plan to create a virtual DPU image for testing. He mentioned some challenges with the data plane app and the flow API. 35:39

 

P4-DPDK for Dash: Chris Sommers updated the group on Andy's work on P4-DPDK for DASH and the issues he encountered with P4 runtime. He said the P4-DPDK team is aware of the issues filed with the P4-DPDK team related to the DASH use case. 43:41

 

 

Other Notes:

 

Introduction/Welcome:
Srikanth Kandula, Partner Researcher at Microsoft Research
https://www.linkedin.com/in/srikanthkandula/

 

                  

Sticky for Links/Data:

 

 

DASH Groups to join to receive Invites, Meeting Notes, and Comms

DASH: https://groups.google.com/g/sonic-dash    

 

DASH-Test-Workgroup Group: https://groups.google.com/g/sonic-dash-test-workgroup  

If anyone knows other people who would like info re: our community, please have them joins these groups for receive Comms, etc…

Links to Recording 

Teams/Sharepoint:
SONiC-DASH Workgroup Community Meeting-20240306_100348-Meeting Recording.mp4

 

DASH Community YouTube:
https://youtu.be/ypY67dlYVgg

HA moved to SmartSwitch LF group

Behavioral Model YouTube:
https://youtu.be/DveEx4nxgug

3/6/2024  Community Call; please request access via the link if you are not able to view/listen
3/7/2024 Behavioral Model call

 

Azure DASH GitHub Repo:                     

https://github.com/sonic-net/DASH

 


Test/Docs folder:

https://github.com/Azure/DASH/blob/main/test/docs/dash-test-workflow-saithrift.md

Ideal test workflow is here, converted to .md

SAI Thrift     

SAI Thrift PR

Client server needed for testing

P4

https://opennetworking.org/p4/ and https://p4.org/working-groups/

Open source, domain-specific programming language for network devices, specifying packet processing for data plane devices (switches, routers, NICs, filters, etc.)

PINS

https://opennetworking.org/pins/

 

PNA consortium spec

https://p4.org/p4-spec/docs/PNA-v0.5.0.html

An architecture describing the structure and common capabilities of network interface controller (NIC) devices which process packets transiting one or more interfaces and a host system.

Describes the structure and capabilities of the pipeline, and a user program, which specifies the functionality of the programmable blocks within that pipeline. For more information, see the P4 Language Consortium specifications

IPDK

Infrastructure Programmer Development Kit (ipdk.io) and

https://github.com/ipdk-io/ipdk-io.github.io

IPDK is an open source, vendor agnostic framework of drivers and APIs for infrastructure offload and management which runs on a CPU, IPU, DPU or switch. IPDK runs in Linux and uses a set of well-established tools such as DPDK and P4 to enable network virtualization.

bmv2

https://github.com/p4lang/behavioral-model

The second version of the reference P4 software switch, nicknamed bmv2 (for behavioral model version 2). The software switch is written in C++11. It takes as input a JSON file generated from your P4 program by a P4 compiler and interprets it to implement the packet-processing behavior specified by that P4 program

DPDK

https://www.dpdk.org/

DPDK is the Data Plane Development Kit which consists of libraries to accelerate packet processing workloads running on a wide variety of CPU architectures.

Linux Foundation SmartSwitch

https://lists.sonicfoundation.dev/g/sonic-smartswitch/calendar

 

      

 

Thank you again for your participation…

Kristina Moore MBA, M.S., CISSP - Azure Core Principal PM / DASH
Office: 425-722-7720     Mobile: 425-876-2040     Email: kri...@microsoft.com
DASH Group to join: https://groups.google.com/g/sonic-dash    
DASH-Test-Workgroup Group to join: https://groups.google.com/g/sonic-dash-test-workgroup

ImageTitle: LinkedIn - Description: image of LinkedIn icon

 

 

Reply all
Reply to author
Forward
0 new messages