A survey of causal inference apps at Netflix

Posted on

At Netflix, we want to entertain the world by creating engaging content and helping members discover titles they’ll love. The key to doing this is understanding the causal effects that connect the changes we make to the product to indicators of member joy.

To measure causal effects, we rely heavily on AB testing, but we also leverage quasi-experimentation in cases where AB testing is limited. Many Netflix scientists have contributed to how Netflix analyzes these causal effects.

To celebrate this impact and learn from each other, Netflix scientists recently came together for an internal summit on causal inference and experimentation. The week-long conference brought together speakers from the content, product, and member experience teams to learn about methodological developments and applications in the estimation of causal effects. We’ve covered a wide range of topics, including difference-in-difference estimation, dual machine learning, Bayesian AB testing, and causal inference in recommender systems, among others.

We’re excited to share a glimpse of the event with you in this blog post through selected sample discussions, giving a behind-the-scenes look at our community and the scope of causal inference at Netflix. We look forward to connecting with you via a future external event and additional blog posts!

Additional impact of location

Lan Yinghong, Vinod Bakthavachalam, Lavanya Sharan, Marie Douriez, Bahar Azarnoush, Mason Kroll

At Netflix, we’re passionate about connecting our members with great stories that can come from anywhere, and be loved everywhere. In fact, we stream in over 30 languages ​​and 190 countries and strive to localize the content, through subtitles and dubs, that our members will enjoy the most. Understanding the heterogeneous incremental value of location to member viewing is key to these efforts!

In order to estimate the incremental value of location, we turned to causal inference methods using historical data. Achieving random experiences at scale presents both technical and operational challenges, particularly because we want to avoid denying location to members who might need it to access the content they love.

Conceptual overview of using dual machine learning to control for confounders and compare similar titles to estimate the incremental impact of localization

We analyzed the data in different languages ​​and applied dual machine learning methods to properly control the measured confounders. We not only investigated the impact of location on the overall display of titles, but also investigated how location adds value to different parts of the member’s journey. As a robustness check, we explored various simulations to assess the consistency and variance of our incrementality estimates. This information played a key role in our decisions to expand localization and delight our members worldwide.

A related application of causal inference methods to localization came when some voice acting was delayed due to pandemic-related production studio closures. To understand the impact of these dubbing delays on the viewing of titles, we simulated viewing in the absence of delays using the synthetic control method. We compared simulated viewing to observed viewing at title launch (when voiceovers were missing) and after title launch (when voiceovers were added).

To control for confounders, we used a placebo test to repeat the analysis for titles that were unaffected by dubbing delays. In this way, we were able to estimate the incremental impact of the delayed availability of dubbing on member viewing for the affected titles. Should dubbing productions stop again, this analysis allows our teams to make informed decisions about delays with greater confidence.

Retaining experiments for product innovation

Travis Brooks, Cassiano Coria, Greg Orties, Molly Jackman, Claire Lacner

At Netflix, there are many examples of restraint AB tests, which show some users an experience without a specific feature. They have dramatically improved the member experience by measuring the long-term effects of new features or re-examining old assumptions. However, when the subject of restraint testing comes up, it may seem too complicated in terms of experimental design and/or engineering costs.

Our goal was to share the best practices we’ve learned about designing and executing holdover tests to bring clarity to holdover tests at Netflix, so they can be used more widely across product innovation teams in :

  1. Define deduction types and their use cases with past examples
  2. Suggest future opportunities where hold testing can be useful
  3. List the challenges of restraint testing
  4. Identify future investments that can reduce the cost of deploying and maintaining restraint testing for product and engineering teams

Retention testing has clear value in many product areas for confirming learnings, understanding long-term effects, retesting old assumptions on new members, and measuring cumulative value. They can also serve as a way to test product simplification by removing unused features, creating a more seamless user experience. In many areas of Netflix, they are already commonly used for these purposes.

Overview of how hold testing works where we retain current experience for a subset of members over the long term to gain valuable insights to improve the product

We believe that by unifying best practices and providing easier tools, we can accelerate our learnings and create the best product experience for our members to access the content they love.

Causal Ranker: A Causal Fit Framework for Recommendation Models

Jeong Yoon Lee, Sudeep Das

Most machine learning algorithms used in personalization and search, including deep learning algorithms, are purely associative. They learn from the correlations between characteristics and results how best to predict a target.

In many scenarios, going beyond the purely associative nature to understanding the causal mechanism between taking a certain action and the resulting additional outcome becomes the key to decision making. Causal inference gives us a way to learn about such relationships, and when paired with machine learning, becomes a powerful tool that can be exploited at scale.

Compared to machine learning, causal inference allows us to build a robust framework that controls for confounders to estimate the true incremental impact on members.

At Netflix, many surfaces today are powered by recommendation patterns like the personalized rows you see on your homepage. We believe that many of these surfaces can benefit from additional algorithms that aim to make each recommendation as useful as possible for our members, beyond simply identifying the title or feature with which a person is most likely to to commit. Adding this new model on top of existing systems can help improve recommendations to those who are right now, helping find the exact title members are looking to stream now.

This led us to create a framework that applies a lightweight, causal adaptive layer on top of the core recommender system called the causal ranking framework. The framework consists of several components: attribution of impression (processing) to play (outcome), collection of true negative labels, causal estimation, offline evaluation, and model dissemination.

We build this framework generically with reusable components so that any interested team within Netflix can adopt this framework for their use case, improving our recommendations throughout the product.

Bellmania: incremental evaluation of the lifespan of accounts at Netflix and its applications

Reza Badri, Allen Tran

Understanding the value of acquiring or retaining subscribers is crucial for any subscription business like Netflix. While customer lifetime value (LTV) is commonly used to assess members, simple LTV metrics likely overestimate the true value of acquisition or retention, as there is always a chance that potential members join themselves in the future without any intervention.

We establish a methodology and the necessary assumptions to estimate the monetary value of acquiring or retaining subscribers based on a causal interpretation of the additional LTV. This requires us to estimate both on Netflix and off Netflix LTV.

To overcome the lack of data for non-Netflix members, we use a Markov chain-based approach that recovers Netflix’s LTV from minimal data on non-subscribers’ transitions between being a subscriber and canceling over time.

Thanks to Markov chains, we can estimate the incremental value of a member and a non-member which appropriately captures the value of potential joins in the future

Additionally, we demonstrate how this methodology can be used to (1) predict the total number of subscribers that meet both addressable market constraints and account-level dynamics, (2) estimate the impact of changes in pricing on revenue and subscription growth, and (3) providing optimal policies, such as price reduction, that maximize expected lifetime revenue from members.

Causality measurement is a big part of the data science culture at Netflix, and we’re proud to have so many amazing colleagues who leverage both experimentation and quasi-experimentation to generate a impact on members. The conference was a great way to celebrate everyone’s work and highlight the ways causal methodology can create business value.

We look forward to sharing more about our work with the community in future articles. To stay up to date on our work, follow the Netflix Tech Blog, and if you’re interested in joining, we’re currently looking amazing new colleagues to help us entertain the world!


An investigation into causal inference apps at Netflix originally appeared in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Reply

Your email address will not be published.