Skip to main content
  1. Blog/

Spotify MLOps Meetup Notes

·2 mins

On 2023-06-22 I attended the NYC MLOps Community meetup at Spotify. Here are some photos and notes from the talks:


David Xia: How to Build a Frictionless ML Platform

  • Spotify uses ray.
  • End-user teams manage their own ray clusters on multi-tenant infra.
  • Hendrix is a (beta) internal spotify tool to save folks from needing to learn K8s.
  • How to solve local development problems?
    • Cloud developer environment (CDE).
    • They built something like GitHub codespaces, but with Ray and GPUs
    • VSCode in the browser: Works the same on any device
    • Everything runs as workloads on K8s
    • They use Istio to get routing done right
  • Lessons Learned:
    • Must be HA: Don’t have a reverse proxy that is a SPOF (single point of failure)
    • Needs to be customizable and extensible
    • Needs telemetry to show that the CDE actually makes people more productive
    • Use K8s etcd as your DB
    • Use K8s operator with CRD, it’s neat

Ryan Culbertson: Near Real-Time Features w/ Jukebox NRT

  • Ryan:
    • Spotify senior engineer
    • ML Infra
    • Feature Mgmt tooling
  • Ideally, the Spotify app makes personalized recommendations right after the user does stuff
    • Therefore, near real-time is desirable
  • Cold-start problem is particularly challenging
  • Near Real-Time (NRT) is just, like, streaming data that gets processed in minutes not seconds
  • (Spotify runs all its stuff on Google Cloud)
  • Scale: Operating on the order of like 3M messages/sec, high-cardinality too
  • Their NRT tool Jukebox uses Flink, Bigtable
    • Idea is to use SQL for experimentation and also prod
    • Jukebox operates in 5-min windows of aggregation
      • Small window: Fresh data
      • Big window: Cheap data
      • Had to find some arbitrary sweet spot
    • 5-minute window aggregations get re-aggregated on reads later
    • Some teams might want a 1-hour window, some might want 6-hour, etc.
  • Lessons Learned:
    • Flink integration w/ GCloud needed custom connectors..! Probably smoother on the more mature Kafka
    • Instrument everything to help find bottlenecks (in multi-thread stuff, e.g.)