On 2023-06-22 I attended the NYC MLOps Community meetup at Spotify. Here are some photos and notes from the talks:

David Xia: How to Build a Frictionless ML Platform

Spotify uses ray.
End-user teams manage their own ray clusters on multi-tenant infra.
Hendrix is a (beta) internal spotify tool to save folks from needing to learn K8s.
How to solve local development problems?
- Cloud developer environment (CDE).
- They built something like GitHub codespaces, but with Ray and GPUs
- VSCode in the browser: Works the same on any device
- Everything runs as workloads on K8s
- They use Istio to get routing done right
Lessons Learned:
- Must be HA: Don’t have a reverse proxy that is a SPOF (single point of failure)
- Needs to be customizable and extensible
- Needs telemetry to show that the CDE actually makes people more productive
- Use K8s etcd as your DB
- Use K8s operator with CRD, it’s neat

Ryan Culbertson: Near Real-Time Features w/ Jukebox NRT

Ryan:
- Spotify senior engineer
- ML Infra
- Feature Mgmt tooling
Ideally, the Spotify app makes personalized recommendations right after the user does stuff
- Therefore, near real-time is desirable
Cold-start problem is particularly challenging
Near Real-Time (NRT) is just, like, streaming data that gets processed in minutes not seconds
(Spotify runs all its stuff on Google Cloud)
Scale: Operating on the order of like 3M messages/sec, high-cardinality too
Their NRT tool Jukebox uses Flink, Bigtable
- Idea is to use SQL for experimentation and also prod
- Jukebox operates in 5-min windows of aggregation
  - Small window: Fresh data
  - Big window: Cheap data
  - Had to find some arbitrary sweet spot
- 5-minute window aggregations get re-aggregated on reads later
- Some teams might want a 1-hour window, some might want 6-hour, etc.
Lessons Learned:
- Flink integration w/ GCloud needed custom connectors..! Probably smoother on the more mature Kafka
- Instrument everything to help find bottlenecks (in multi-thread stuff, e.g.)