Multivariate A/B Testing with Dependent Experiments

This is a continuation of my previous post on implementing a generic A/B testing service – here I will discuss the intricacies around one of the biggest problems faced: the problem of supporting multiple simultaneous experiments, some of which may influence the outcomes of others. The solution discussed here is based on Google’s approach to A/B testing.

Allowing a user to be in more than one experiment
One of the first issues to address is when it is acceptable for a user to be in two experiments at once. Of course, if we have several experiments testing the same thing (e.g. in the advertisement use-case, where there are several statistical models and we want to find out which one yields the highest click-through rate), then any given user can only be in one of those experiments. On the other hand, if we have two unrelated experiments which are testing completely independent features, then it’s fine for a user to be in both experiments at the same time.

We need a way to define and represent experiments in a way that would enable us to capture the different semantics between similar and independent experiments. For this, we borrow Google‘s terminology:

  • Layer – A layer represents a specific type of A/B test. For example, we may have a layer for representing the A/B testing of statistical models used for boosting click-through rates.
  • Experiment – An experiment lives within a parent layer and each experiment represents a specific test. For example, we may have an experiment for testing model #5.

These terms are depicted more clearly in the diagram below, which shows an A/B testing configuration for two independent features.

Example depicting A/B testing terminology

Now, it becomes clear when a user can participate in multiple experiments. Within each layer, a user can only be in one experiment but across layers, this constraint does not apply. Now we just need a way to assign users to an experiment within each layer.

Ensuring that traffic is always allocated fairly

When users are being assigned to multiple experiments, it is easy to allocate traffic (i.e. mapping users to expeirments) in a way that is unfair and leads to biased and unmeaningful results. Consider our two independent features scenario and imagine that users are distributed horizontally across the diagram:

Unfair user allocation for two independent features

Here, we can see that if a user is assigned the experiment Feature A: On, they will also be assigned Feature B: On. So, if we later analyse the KPIs for the purple users vs the red users, we will have no idea whether any change in the KPIs has come from Feature A or Feature B. We need to allocate traffic in a fair way:

Fair user allocation for two independent features

Here, the users assigned Feature A: On are equally split over Feature B: On and Feature B: Off. Now, if we look at the KPIs for the purple users vs the red users, we know that any change we see is purely down to Feature B. We have achieved this by partitioning experiments within each layer equally across experiments within the other layer.

In this simple example, where we only have two layers and two experiments within each layer, with a 50/50 traffic split, the partitioning solution is simple. However, our generic service needed to be able to handle scenarios where there are potentially many layers and many experiments within each layer, with arbitrary traffic splits. To solve this, we took another approach. Instead of ensuring experiments are partitioned fairly and then assigning a user to a fixed position across all layers, we randomly assign users to a position within each layer. To build the logic behind this, we introduce a new concept:

  • Bucket – A layer is split into many buckets and a bucket maps to a single experiment. This is used in traffic allocation: a user is assigned a bucket and that bucket determines which experiment they land in.

As long as users are uniformly distributed across the buckets in each layer, and there is no dependence between the assignments in one layer and the assignments in another, then we will end up with a fair allocation but with much simpler logic than the experiment partitioning approach would require. We also must ensure that any given user is always assigned to the same bucket within each layer – we don’t want users to end up in different experiments in subsequent requests. Effectively, we need a random but deterministic mapping from users to buckets which is independent between layers.

To achieve this, we concatenate the user’s ID with each layer ID, and then use a hash function to map the user to a bucket within each layer:

As long as we use a good hash function and choose a power of two for our number of buckets (see this stackoverflow post for an explanation), this will give the random but deterministic mapping that we need. For four users, we may end up with a mapping that looks something like this:

Bucket allocation for two independent features, by hashing user_id and layer_id together

Configuring experiments with eligibility criteria

Another requirement for our A/B testing framework was to enable experiments to have some kind of eligibility criteria which cause the experiment to only be active under certain conditions. This means associating a set of attributes with the experiment. For example, the attributes {“country”: “UK”, “gender”: “male”} indicate that a user is only eligible for an experiment if they are male and live in the UK. If a user lands in an experiment that they are not eligible for, we give them the ‘default experience’, and to avoid introducing bias that user’s data is discarded and not considered when calculating KPIs.

When using the layer model and eligibility criteria are involved, there are situations where we can introduce new layers for the purpose of minimizing the amount of data discarded. For example, say that we want to test Feature A only in the UK and Denmark. We could define a layer with four experiments, like this:

Eligibility criteria – a poorly defined configuration

This works and adheres to our original definition of a layer, however this configuration results in some traffic being discarded unnecessarily. For example, if a UK user lands in a Denmark experiment, they will be given the default experience but will not participate in an experiment. This is clearly not optimal – only 50% of UK and DK users will participate in an A/B test. To avoid this wastage, we can simply use two layers instead of one:

Eligibility criteria – a well defined configuration

Defining multiple layers in this way is acceptable whenever the users eligible for each experiment are disjoint. If this was not the case in the above example, then we may end up with a conflict: a user could be assigned ON in one layer and OFF in another. If we wanted to add another A/B test, testing Feature A on all male users, then we would have to define all experiments in a single layer as there is some overlap in eligible users.


As we’ve seen, A/B testing can become fairly complex when running multiple simultaneous experiments and there are some subtle issues that must be addressed in order to ensure that experiments are fair, meaningful and optimally allocated to minimise wasting valuable traffic. For a tech company, A/B testing has enough complexity to warrant building it as an isolated service, enabling you to have powerful A/B testing functionality across a range of applications without needing to deal with the same problems in multiple projects.

Implementing a Generic A/B Testing Service

The ability to do A/B testing has become a key requirement of many tech driven companies. Without the ability to run meaningful A/B tests, how can we know for sure that changes and new features are resulting in the improvements that we anticipate? Here are a few examples of how A/B testing is used:

  • A social network, changing the number of required fields on their registration webpage and monitoring how it affects the percentage of visitors who register
  • An e-retailer, making a change to the way products are ordered by an e-commerce search engine and monitoring how it affects the number of product orders attributed to search
  • An advertiser, comparing statistical models used in placing advertisements and seeing which model leads to the highest click-through rate

In fact, any product which involves user interaction can be A/B tested to ensure that changes or new features are having a positive effect on relevant key performance indicators (KPIs).

In this post, I will discuss the motivation and benefits of implementing A/B testing as a generic service, to be shared and used throughout an organisation.

Consider the most simplistic implementation of A/B testing.
This will typically involve splitting the users of an application into two equal halves and then giving each of the two user groups a different experience. This could be achieved by adding some simple logic inside your application:

Here, we take the integer hash code of a user’s ID and apply modulo 2. Then, if we get an even number, give the user experience A and if we get an odd number, give the user experience B. We can store or log the experience given to each user and later compute some KPIs for the ‘odd’ users and the ‘even’ users and compare the two user groups.

This implementation should work fine, but in practice, there are a number of other things that are usually required from an A/B testing framework:

  1. Be able to disable a live experiment, or change the traffic allocation.
    For example, you might want to give just 1% of users a new feature and verify that it’s working as expected before ramping the traffic allocation up to a 50/50 test. In our simple implementation, the traffic allocation is hard-coded into the application, so we’d have to redeploy every time we make a change. Of course, we could use a database to store experiment configuration and make it possible to change the configuration while the application is running, but this is a lot more work to implement and requires you to maintain a database.

    Another concern here is around how you later perform your KPI analysis. Imagine a traffic allocation change being made to a live A/B test, at a certain point in time:

    Scenario where a traffic allocation change is made to a live A/B test.
    Scenario where a traffic allocation change is made to a live A/B test.

    When you perform KPI analysis to compare your experiments, it’s acceptable to perform the analysis before the change was made, or after the change was made, but not across the time of the change. This generally applies to any type of change – changes will affect your KPIs in some way and your results will not be meaningful if you analyse your experiments across the time when a change was made. It would be good if our A/B testing logic had some way of enforcing meaningful analysis.

  2. Have experiments with multiple parameters.
    In example above, there is just one parameter: a boolean indicating whether the new feature should be turned on or off. But it’s possible that you might want to have an experiment with many parameters, and these parameters may have different types. Hard-coding these parameters into your application will quickly become messy and hard to organise.

  3. Have multiple experiments running at the same time.
    Things can get tricky when you want to run multiple experiments at the same time. When is it OK for a user to be in two experiments at once? Of course, if you have several experiments testing the same thing (such as in the advertiser use case mentioned earlier, where there are several statistical models and you want to find which one yields the highest click-through rate), then a user can only be in one of those experiments. On the other hand, if you have two unrelated experiments which are testing completely independent features, then it’s fine for a user to be in both experiments at the same time.

    There are further complexities around how to allocate traffic fairly when users are being assigned to multiple experiments. Consider the case where we are A/B testing two new, independent features. For each feature, we want to find out whether it’s better to have that feature enabled or disabled. The scenario is depicted in the below image:

    A/B testing two independent experiments simultaneously, with an unfair traffic allocation scheme.
    A/B testing two independent experiments simultaneously, with an unfair traffic allocation scheme.

    Here, if you imagine that users are distributed horizontally across the diagram, then drawing a vertical line anywhere on the diagram would represent a single user’s experience. Here, if a user is assigned the experiment Feature A: On, they will also be assigned Feature B: On. So, if we later analyse the KPIs for the purple users vs the red users, we will have no idea whether any change in the KPIs has come from Feature A or Feature B. We need to allocate traffic in a fair way:

    A/B testing two independent experiments simultaneously, with a fair traffic allocation scheme.
    A/B testing two independent experiments simultaneously, with a fair traffic allocation scheme.

    Here, the users assigned Feature A: On are equally split over Feature B: On and Feature B: Off. Now, if we look at the KPIs for the purple users vs the red users, we know that any change we see is purely down to Feature B.

    In this example, the solution is quite obvious, but what if we are testing many more features simultaneously – some of which have more than two variations. We need a way to ensure traffic is always allocated fairly, in any situation.

  4. Introduce segmentation logic.
    This is where you want your experiments to behave differently depending on the situation. For example, you might want an experiment to only be applicable to users from a certain country. Again, if you are writing your own A/B testing logic, this is more work for you to implement in your application.

  5. Validate experiments before they go live.
    In particular, this applies in the case where your experiments have many parameters of different types. There’s always the possibility of human error when you are configuring or making changes to an experiment. It would be nice to have a way to validate experiment configuration and ensure we are setting the right parameters with the correct types.

Implementing all of these features is a lot of work – and certainly not something you want to repeat in every application you develop that has some form of A/B testing requirement. A nice solution is to build an A/B testing service which handles things like storing experiment configuration, validation, segmentation, running multiple experiments simultaneously and allocating traffic fairly. Such a service could be used in two ways:

  1. By people: to define, configure and manage experiments
  2. By an application: to retrieve a set of parameters for handling a specific request

This can be hugely beneficial for anyone building an application that wants to do A/B testing. Instead of implementing a load of A/B testing logic in your application, your application just needs to make a single call to a service – effectively asking the question ‘what experience should I give this user?’. With this design, a client application just needs to pass a userId (or some other data element to use to split traffic into experiments) and a set of attributes which may be used for segmentation (for example, country=denmark might be an attribute). The client will then receive a response with a set of parameters which it can use to determine what experience to give the user. Using our earlier example, the response parameters would consist of something like featureEnabled=true. No other A/B testing logic needs to be implemented in the application itself!

Having a generic A/B testing service clearly brings many benefits when you have multiple applications that need this type of functionality. But, there is a notable drawback to the scheme described above which I came across when building an A/B testing service to be used in my organisation. This design assumes that it is feasible to make a request to a separate service for every request processed by the application. There are two potential issues with this assumption: the first being that in super-low latency applications, adding a couple of milliseconds to the response time may be unacceptable, and the second being that in high-load applications, the volume of requests may be too high for it to be feasible to call an external service on every request.

We hit the second issue – as we have an application which handles hundreds of thousands of requests per second and it wasn’t feasible to give the A/B testing service enough resource to handle that kind of volume. Our solution was to allow this application to download experiment traffic allocation instructions periodically from our service and implement some additional A/B testing logic in the application. This solution is not as clean as the call-on-every-request approach as it requires more work in the client application, such as segmentation logic, but the application still benefits from the service handling configuration storage, validation and traffic allocation.

For a more detailed description of the logic behind handling multiple simultaneous experiments, check out my next post (originally published on the Adform engineering blog). I also recommend reading this paper from Google, which the ideas in this post are based on.

Real-Time Analytics with Elasticsearch

When you are running a website used by thousands of people, it should go without saying that very valuable data can be collected about your users and they way they interact with your site. This usually comes in the form of click-stream data – which is effectively logs written by your webserver on every interaction (or click) made by a user who is navigating your site. An e-retailer can use this data for a many purposes, for example: analysing how users interact with the site and what causes them to place an order, monitoring the performance of a new advertising campaign, seeing how users respond to a new feature and detecting problems such as pages which result in an error 404. For many of these purposes, raw data can be crunched through a batch processing system such as Hadoop, but for some purposes it is beneficial have real-time information and statistics about how users are interacting with the site right now.

Imagine you are in e-retailer in the process of rolling out a brand new feature, such an improved search engine. As soon as this feature goes live, it will have an immediate impact on the user experience and the products people are interacting with. It may even change the way people search and the terms they search for. Having real-time statistics such as the number of page views, searches, product visits from searches and product orders from searches gives confidence that the new experience is having a positive impact, or at least not causing any serious problems. This article discusses our experience with building a data store to persist and query over click-stream data in real-time.

For the backing datastore of our analytics framework there were a huge number of possible solutions. We began by considering SQL, but as we are dealing with millions of click-stream events every day SQL can quickly become an expensive and not-so-scalable option. Ultimately we made a choice between MongoDB and Elasticsearch. Both are open source, distributed NoSQL document stores – Elasticsearch is built on top of Apache Lucene (a high performance JVM based search engine) and MongoDB is written entirely in C++. Elasticsearch provides a very clean and compact RESTful API which enables you to use JSON over HTTP to specify your configuration, index documents and perform queries. We found that the query API provided by Elasticsearch provides considerably more flexibility when it comes to faceting and search. An example of this is when you want to calculate a count, total or average and see how it changes over time. For instance, if we want to count the number of searches performed on our site by hour, the date histogram facet makes this easy. The following Elasticsearch query will return a set of key value pairs for each site, where the key is an hourly timestamp and the value is a count of search events:

{
  "query": {
    "match": {
      "eventType": "search"
    }
  },
  "facets": {
    "siteId": {
      "date_histogram": {
        "field": "timestamp",
        "interval": "hour"
      }
    }
  }
}

Doing this kind of aggregation with MongoDB is of course possible, but we found these queries are simply quicker and easier to write with Elasticsearch. Plus the API makes it very easy to query the store with a REST client. Of course there are advantages to using MongoDB – for instance at the time of writing there is far more documentation available, MongoDB provides more flexibility as the document structure can be changed at any time and also MongoDB is particularly good at facilitating backup and recovery. Since we were not looking for a primary data store, these advantages didn’t outweigh Elasticsearch’s cleaner API.

One of the first decisions to make when using Elasticsearch is how to structure your indices. In our case, we wanted to capture several different types of events, each of which would have different associated attributes. Page visits, searches, basket adds among other events are all logged by our web servers in real-time. This data then needs to be sent to Elasticsearch and indexed. We decided to use time-based Elasticsearch indices. This means that every day, a new index is created and all data for that day is stored within that index. The index name takes the format ‘EVENT_TYPE-YYYYMMDD’. There are a few advantages to doing this:

  • It’s easy to query over only the events and days we want. Elasticsearch supports wildcards matching for the index name – so if we send a query to ‘searches-201401*’ our query will be executed over all the search events which took place in January.
  • When we want to delete old data, we simply need to delete old indices. Clearing out old data can then be done easily and automatically using a daily cron job. No need to use time-based indices, which could add significant overhead.
  • Routing and sharding can be configured on a per index basis. This makes it possible for us to configure which servers should handle which days. For example, if a server is running out of disk space, we can just configure shards for future indices to avoid that server.
  • Using Elasticsearch templates, we can configure a mapping and settings which are automatically applied to new indices. For example, any index created with a name matching ‘searches-*’ will have a specific mapping and settings applied to it.

With a good idea of how the data store would be structured, the next step was to get data from our web logs into Elasticsearch, in real-time. This process involves aggregating log files, and then parsing and indexing the log events into Elasticsearch documents. We developed a Java application to serve this purpose – sending bulk updates to our Elasticsearch cluster every few seconds. The simplified data pipeline is pictured below, although we have since replaced log aggregation with Kafka, a distributed messaging framework.

The flow of data from web servers to Elasticsearch
The flow of data from web servers to Elasticsearch

We soon encountered some difficulties when writing queries over our data. One of the biggest difficulties comes when running queries over multiple indices. Each of our event types has different attributes, for example: search terms and search engine information appear on search events, but not on product visit events. What if we want to count the number of product visits resulting from searches for the term ‘xbox’? Well, we want to perform a ‘join’ operation, something which would be quite easy to achieve in SQL:

SELECT * FROM product_visits p INNER JOIN searches s ON (p.userId = s.userId) WHERE p.referrer_url = s.url AND s.term = 'xbox'

With Elasticsearch, things are not so simple. We can easily query several indices at once, but how can we specify that we want to find all product visit documents which have a matching search document?

Solution 1: Use parent child relationships
Elasticsearch enables us to specify a parent child relationship between document types. For instance, we could have two types of document: product_interaction documents (which contain event types product_visit and product_order) and cause documents (which contain event types like search). We could then configure a parent-child relationship between causes and product_interactions, such that each product_interaction is the child of a single cause. This structure enables us to use the built in ‘has_parent’ query to find all product visits that resulted from a specific search:

POST searches-201401*/_search

{
  "query": {
    "has_parent": {
      "parent_type": "cause",
      "query": {
        "match": {
          "searchTerm": "xbox"
        }
      }
    }
  }
}

This solution is flexible. The two document types are completely independent of each other – if we want, we can update one without affecting the other. To aid performance, Elasticsearch ensures that children are physically stored in the same shard as their parent. However there is still a performance cost to using the parent child relation and we found that our query speed sometimes suffered. There is also a big limitation to this design – there is no support for more complex relationships, such as grandparent->parent->child. So, we would be unable to add further relationships (e.g. between product_orders and product_interactions) and model the full chain of events.

Solution 2: Add extra information at indexing time
Our end solution which solves both of the above problems is simply to move the logic surrounding relationships into the application layer. State corresponding to recent user activity (for example, searches in the last ten minutes) are stored in memory in our indexing application. Then, when product visit events are received, the in-memory state is checked and the ’cause’ of the product visit can be found. Then, when indexing the product visit document, extra attributes are indexed within the document such as the search terms which led to the product being visited. If we want to track more events in the train, such as product orders, then attributes that were added to the product visit document are copied into the product order document. This means that at query time, we don’t need to search over multiple document types and perform a join – all the information we need is contained in a single document type. This solution gives us very fast query performance at the cost of higher storage and memory requirements – since document size is greater and some information is duplicated.

With this setup, we are able to comfortably query over a month’s worth of data within just a couple of seconds – and that’s with a single node cluster. One problem we encountered is that it is possible to write incorrect or very computationally demanding queries and no matter how complex or demanding the query, Elasticsearch will attempt the computation. If it does not have sufficient resources to complete the query it will fail with an OutOfMemory error and the only way to recover from this is to restart Elasticsearch on the failed node(s). Aside from this, our experience with Elasticsearch for near real-time analytics has been very positive.