skip to content

§01   resume / full

download pdf ↓

Shivansh Ahuja

C++ Developer | ML Infrastructure Engineer

shivanshahuja.dev·shivansh.ahuja91@gmail.com·+91 8383916727·India

§01

summary

[pdf] download full resume

Systems-oriented developer focused on C++, distributed training infrastructure, and GPU-backed ML platforms. Built end-to-end orchestration, checkpointing, runtime launch, observability, and managed Kubernetes flows using Python, Go, Kubernetes, and Volcano. Background also includes Java backend development and graphics/game programming.

§02

experience

Research Commons

· C++ Developer Intern

|—— May 2025 - Present ——|   Remote

[a]

C++ / ML Systems

  • Built and optimized components of a C++ AI inference engine and ML framework from the ground up, working on low-level internals such as memory management, threading, execution flow, and performance-critical code paths.
  • Wrote optimized kernel code for x86 and ARM, including SIMD paths with AVX and NEON, and also implemented NVIDIA CUDA kernels for selected workloads.
  • Worked with systems and compiler-oriented concepts including multithreading, execution pipelines, compiler IRs, JIT techniques, profiling, operating systems, computer architecture, and digital design.
  • Reverse engineered and broke down large frameworks such as PyTorch and tinygrad to understand autograd, tensor operations, and framework internals, then translated those learnings into team-facing implementation guidance.
[b]

Distributed Training Platform

  • Built the entire Kubernetes-native distributed fine-tuning platform end to end, spanning a Python client framework, Go operator, TrainingJob CRD contract, and Volcano-based workload orchestration.
  • Implemented lifecycle orchestration for torchrun, DeepSpeed, and Python evaluation runtimes across Kubernetes and Volcano, including submission, reconciliation, cancellation, and pause or resume flows.
  • Built checkpointing and resume infrastructure in Python with atomic commits, async uploads, retention, orphan cleanup, and checkpoint status publishing for distributed training jobs.
[c]

Cloud / Observability

  • Implemented managed GKE provisioning and validation using Kubernetes, GKE, Workload Identity, and Artifact Registry, including cluster bootstrap, preflight checks, and observability stack wiring.
  • Added observability and reliability features using structured logs, Prometheus metrics, and pod-level restart or OOM diagnostics to improve debugging and production stability.

SAP Labs India

· Scholar

|—— Aug 2023 - Jul 2024 ——|   Bengaluru, India

  • Developed backend APIs for payroll systems in SAP SuccessFactors using Java, Spring Boot, J2EE, and SAP S/4HANA.
  • Integrated microservices across internal services and wrote automations and unit tests for production workflows.

Indian Institute of Technology Delhi

· Game Developer Intern

|—— Dec 2021 - Jul 2022 ——|   Delhi, India

  • Built a VR pit simulation in Unity and contributed to additional simulation work during the internship.
§03

projects

§04

stack

Languages
C++, C#, Java, Python, CUDA
Systems / ML Infra
Modern C++, multithreading, OpenMP, SIMD (AVX), memory management, distributed training, checkpointing, Docker, Kubernetes, Volcano, Ray jobs
Backend / Platform
Spring Boot, J2EE, REST APIs, SQL, Kafka, Git, CMake
Graphics / Games
OpenGL, SFML, SDL, Unity 3D, Unreal Engine 5.1, Cocos2D-x
Math / CS
Linear algebra, 3D math, trigonometry, operating systems, computer architecture, digital design
§05

accolades

§06

education

Guru Gobind Singh Indraprastha University

|—— Jul 2020 - Jun 2023 ——|

Bachelor of Computer Applications · New Delhi, India