Research Commons

C++ Developer Intern

C++ and ML infrastructure work across three loosely-coupled areas. Framework-level systems code on one end, a Kubernetes-native distributed training platform in the middle, and the managed cloud plumbing holding it all together.

§01

c++ / ml systems

Day-to-day work on a C++ AI inference engine and ML framework — low-level internals, kernel paths, and tight feedback loops with how PyTorch and tinygrad actually work under the hood.

Built and optimized components of a C++ AI inference engine and ML framework from the ground up, working on low-level internals such as memory management, threading, execution flow, and performance-critical code paths.
Wrote optimized kernel code for x86 and ARM, including SIMD paths with AVX and NEON, and also implemented NVIDIA CUDA kernels for selected workloads.
Worked with systems and compiler-oriented concepts including multithreading, execution pipelines, compiler IRs, JIT techniques, profiling, operating systems, computer architecture, and digital design.
Reverse engineered and broke down large frameworks such as PyTorch and tinygrad to understand autograd, tensor operations, and framework internals, then translated those learnings into team-facing implementation guidance.

§02

distributed training platform

Built a Kubernetes-native distributed fine-tuning platform end to end: a Python client, a Go operator, a TrainingJob CRD contract, and Volcano-based workload orchestration. The most interesting parts were lifecycle flows (cancel, pause, resume) and the checkpoint/resume pipeline.

Built the entire Kubernetes-native distributed fine-tuning platform end to end, spanning a Python client framework, Go operator, TrainingJob CRD contract, and Volcano-based workload orchestration.
Implemented lifecycle orchestration for torchrun, DeepSpeed, and Python evaluation runtimes across Kubernetes and Volcano, including submission, reconciliation, cancellation, and pause or resume flows.
Built checkpointing and resume infrastructure in Python with atomic commits, async uploads, retention, orphan cleanup, and checkpoint status publishing for distributed training jobs.

§03

cloud / observability

Stood up managed GKE provisioning and the observability stack that sits underneath training workloads — preflight checks, Workload Identity wiring, Prometheus metrics, and pod-level diagnostics.

Implemented managed GKE provisioning and validation using Kubernetes, GKE, Workload Identity, and Artifact Registry, including cluster bootstrap, preflight checks, and observability stack wiring.
Added observability and reliability features using structured logs, Prometheus metrics, and pod-level restart or OOM diagnostics to improve debugging and production stability.

§04

open source projects made

§01
» github

cpptensor

C++ CMake Docker SIMD CUDA

A C++ tensor library developed at Research Commons — the substrate the inference engine and ML framework sit on top of. Worked across the build system and the kernel layer.
what i worked on
- C++ implementation across the tensor library — core data structures, execution paths, and performance-critical internals.
- CMake build configuration spanning multiple targets, dependencies, and compile-time options.
- Docker-based compilation pipeline for reproducible builds across x86 and ARM toolchains.
- SIMD kernels for CPU paths using AVX and NEON to accelerate hot tensor ops.
- CUDA kernels for GPU workloads covering the ops where a GPU path was worth the complexity.

← index full resume pdf

cpptensor