Alexey Elkin

Alexey Elkin

Backend engineer who's been in the weeds at two NYC startups. I like hard infrastructure problems. Currently at Terra (YC W24).
Cornell University, B.S. Computer Science, '25.

Experience
Software Engineer · Terra
New York, NY · January 2026 – Present
  • Owned and shipped a mobile-first merchant platform — inventory, pricing, fulfillment routing US to China — supporting 100+ merchants in live production.
  • Built an AI-powered catalog ingestion pipeline (GCP Cloud Run, Claude API) converting messy supplier spreadsheets into structured product data, automating onboarding of 1000+ SKUs.
  • Fixed payment inconsistencies between Stripe and PostgreSQL by building a reconciliation system with Slack alerting, preventing revenue leakage across $30M+ in transactions.
  • Designed a real-time shipping pricing system (TypeScript, Node.js, PostgreSQL) with carrier integrations and caching, enabling dynamic checkout and per-product shipping logic.
Software Engineer · Refine
New York, NY · June 2025 – November 2025
  • Re-architected distributed analytics pipeline (Go, Redis, ClickHouse), cutting API latency 85% (300ms → 50ms) for workloads handling millions of search events.
  • Built production monitoring and alerting infrastructure (GCP, Cloud Monitoring), reducing incident response time 70% and maintaining 99.9% uptime for mission-critical AI systems.
  • Architected CI/CD and Kubernetes deployment workflows (GitLab, GKE), consolidating 12 microservices into a monorepo with automated rollouts in under 5 minutes.
  • Developed high-throughput analytics interfaces (React/TypeScript) supporting 50K+ monthly queries for enterprise decision-making.
Research & Teaching · Cornell University
Ithaca, NY · 2022 – 2024
Research I

Built Python pipelines to analyze meme-based sentiment during the Ukraine war, scraping 5,000+ Reddit posts. Published by the Brookings Institution.

Teaching Assistant

Led labs and office hours for 40+ students in Intermediate Web Development (CS 2300). Taught SQL indexing, partitioning, and normalization.

Cornell Brooks TPI · Future Leaders Spotlight: AI & Democracy

Lavendr Software · SWE Intern

Remote · Summer 2024 · Rails backend, RSpec test coverage to 90%, PostgreSQL optimization.

Jade Studios · SWE Intern

Ithaca, NY · Spring 2024 · Led team of 8 building a 2D Java game; built UI framework and level editor.

Projects
Fault-Tolerant Idempotent Job Pipeline In Progress
TypeScript · Node.js · GCP Cloud Run · Pub/Sub · PostgreSQL · Redis

This came out of a problem I kept running into at Terra. We process a lot of messy Excel and CSV files from suppliers, turning them into structured product data. The pipeline that handles this would occasionally fail partway through a file, and there was no clean way to retry without risking duplicate rows or half-written data. I wanted to build a standalone system that solves this correctly from the ground up.

The core flow is: a spreadsheet gets uploaded to GCS, that triggers an event into Pub/Sub, and workers on Cloud Run pick up the file, parse it, validate the rows, and write structured output to PostgreSQL. Every job gets a deterministic idempotency key based on the file contents. Before writing anything, the worker checks a Redis dedup store to see if that file has already been processed. If a worker crashes mid-parse, the message goes back to the queue and gets retried safely. Same file, same key, same result. Jobs that fail after max retries go to a dead letter queue.

The piece I'm most focused on right now is the replay system. The goal is to be able to resubmit any failed job or entire batch without worrying about side effects. Most pipelines treat reprocessing as an afterthought, but I want it to be a first-class thing. I'm also adding structured logging, per-job tracing through the full lifecycle, and basic metrics for success rates and latency.

Once the core is solid I'm planning to add failure simulation: crash workers mid-job, inject duplicate requests, force partial writes, and verify the system produces the correct output every time regardless.

View on GitHub
Writing
Building a $30M+ Payment Reconciliation System March 2026
Terra  Stripe · PostgreSQL

Our payment infrastructure had no way to detect when Stripe and our database fell out of sync. Stripe processes a payment, we write the result to our database, and if that second step fails for any reason there's no error, no alert, nothing. The two systems just quietly disagree. I noticed this while reviewing our payment flows and started mapping out where it could happen.

I found four places. An operation that touched Stripe and our database in sequence, with nothing watching what happened in between. A refund could process on Stripe's side and never land in our records. A payment could go through but fail to create an order on our end. Creator earnings could simply never be calculated if a specific payment event was missed. And a payout could be marked as completed in our system before the actual bank transfer was attempted, leaving money stuck in a permanent "completed" state that never actually moved. None of it produced a visible error.

Patching each flow individually would have been the wrong call. The real problem is structural: you can't make two external systems agree at the exact moment of writing. The right answer is to accept that inconsistencies will occasionally happen and catch them after the fact.

I built a background job that runs on a schedule, looks back over the last 24 hours, and compares what Stripe says happened against what our database recorded, across payments, refunds, invoices, fees, and payouts. Any gap triggers an immediate Slack alert. The job only detects and reports, it never silently corrects, because you want to understand a new failure mode before you automate a fix for it.

Cutting API Latency 85% with ClickHouse and Redis June 2025
Refine  ClickHouse · Redis

Refine's product is a search engine, which means the database was ingesting a continuous stream of search events from customer sites around the clock. ClickHouse, which is built for analytical queries over large volumes of time-series data, was storing all of it. The analytics dashboard was reading from it directly, and the problem was that every single load was scanning the full event history to compute what to show. Roughly 50 million records on every request. You could see it in how long the page took to come up.

The first thing I changed was how the data was structured for querying. Instead of aggregating across the raw event history on every request, I set up materialized views that pre-aggregated the data by hour, day, and week as events came in. A query for "last 7 days" went from scanning tens of millions of rows to reading a few hundred pre-computed ones. That handled most of the latency.

The second change was caching. The results of a dashboard query for a given time window don't change until new data arrives, so hitting ClickHouse on every load was unnecessary. I added a Redis cache in front of the queries so that repeated requests for the same window returned instantly. Between the two changes, response times for uncached requests came down from 300ms to around 50ms, and cached requests returned instantly.

Building a Promotions and Coupon System on Top of a Live Payment Stack February 2026
Terra  Stripe · PostgreSQL

When I built the promotions system at Terra, the first question wasn't what features it needed. It was what a discount actually does to the payment stack. It changes what Stripe charges, what the creator gets paid, and how refunds are calculated if the order is returned. Getting any of those wrong means money ends up in the wrong place.

At Terra, creators sell products through their own storefronts. The core design question was who absorbs the cost of a discount. The answer we landed on was simple: the creator always takes the hit. If a creator issues a 20% off code, their payout is reduced by that amount. Terra's cut stays fixed. The Stripe charge reflects the discounted total, and the payout calculation works backwards from there.

The harder problem was what happens when a creator wants to discount beyond their margin. We allowed it, but Terra can't lose money in the process, so any order where the discount would have eaten into Terra's cut gets an automatic adjustment. The creator sees exactly what they're agreeing to before they publish the code.

Stacking was the other non-trivial piece. Creators could run sitewide promotions alongside individual coupon codes, so we had to make explicit decisions about when they could be combined and which took priority when multiple qualified at once.

One bug that came up in production: address updates at checkout were silently wiping the coupon discount from the order. The customer had seen the discounted total, but by the time the order confirmed the discount was gone. Once we identified the cause we fixed it and ran a backfill job to correct every affected order.

Skills
Core Go · TypeScript · Python · PostgreSQL · Redis · ClickHouse · gRPC
Infra GCP · Kubernetes · Docker · GitLab CI/CD · AWS · Stripe
Frontend React · Next.js · Node.js · HTML/CSS
Also Java · Ruby · C++ · Shell
Education
Cornell University
B.S. Computer Science · Class of 2025
Tonbridge School
A-Level & GCSE · Maths · Further Maths · Physics · Chemistry · 2016 – 2021
𝄞 Guitar & Piano