Kevin He

About Me

I'm a PhD student in Computer Science at Harvard University, advised by Prof. David Brooks. I am broadly interested in computer architecture and ML systems, with an emphasis on improving the performance and energy efficiency of modern generative AI workloads. My research focuses on hardware-software co-design for efficient machine learning, including designing novel ML accelerator architectures and edge inference serving.

Previously, I completed my BA in Computer Science at UC Berkeley, where I was an undergraduate researcher at SLICE Lab advised by Prof. John Wawrzynek and Prof. Krste Asanović. I also spent two summers interning at Apple on the CPU Performance and GPU RTL Design teams.

Current Research

Harvard Architecture, Circuits and Compilers

Working on enabling LLM inference on a novel distributed multi-chiplet transformer accelerator by writing custom ISA kernels. Profiling power and performance on end-to-end LLM serving pipelines for resource-constrained edge devices.

Selected Projects

uArchDB (Iris)

uArchDB is an extensible graph-based microarchitecture event logging API for profiling and debugging open-source processors in RTL. uArchDB is built for the Chipyard hardware design framework and enables users to generate easy to understand waterfall visualizations from their designs. We presented this work at the 2024 SLICE summer research retreat. Check out our poster and presentation.

uArchDB on Gemmini

As part of EE290 Hardware for Machine Learning, we used uArchDB to annotate Gemmini, a systolic array based DNN accelerator. We annotated the fine-grain instruction FSMs, reservation station, systolic-array, scratchpad, and accumulator. We also implemented Gemmini instruction decoding in the backend Python script for readibility in the waterfall viewer. With uArchDB, we could easily visualize the interactions between long latency memory and compute operations in the accelerator. Check out our report.

RISC-V Vector SoC Tapeout on Intel16

As part of EE194 Tapeout, we implemented Saturn, a RISC-V Vector Version 1.0 spec compliant microarchitecture on a 2x2mm Intel16 SoC. We implemented 4 in-order Saturn cores with 256-bit vector length register files and 4 KB L1 ICaches and DCaches. We targeted a minimal area configuration optimized for integer operations for ML inference and DSP applications. Our SoC also includes a 1D torus NoC, 256 KB L2 cache, convolution and FFT accelerators, a DMA engine, and 8-channel audio. The chip was accepted by Intel in May 2024 and has arrived for bringup. Check out our poster and presentation.

Publications

High-Throughput SAT Sampling [Paper]
Arash Ardakani, Minwoo Kang, Kevin He, Qijing Huang, John Wawrzynek
DATE 2025

DEMOTIC: A Differentiable Sampler for Multi-Level Digital Circuits [Paper]
Arash Ardakani, Minwoo Kang, Kevin He, Qijing Huang, Vighnesh Iyer, Suhong Moon, John Wawrzynek
ASP-DAC 2025

Late Breaking Results: Differential and Massively Parallel Sampling of SAT Formulas [Paper]
Arash Ardakani, Minwoo Kang, Kevin He, Vighnesh Iyer, Suhong Moon, John Wawrzynek
DAC 2024