Toggle It On – nerDigital’s New Feature Flag Solution

This year, nerDigital moved to a dedicated, advanced and sophisticated feature flagging solution in order to deliver features faster, at a higher quality, by combining Beta customer inputs and the results of A/B testing experiments.

In this blog, I’ll outline why we made the change, how we did it, and detail some of the exciting results.

GOALS & REQUIREMENTS

In the past, when we wanted to control the opening or shutdown of features on production, or detach a feature’s release from the deploy process, we used configurations.

While configurations did the job, we decided to examine using a dedicated mechanism for flags. To understand the motivation before this decision, let’s start by defining the goals we wanted to achieve, and the new requirements they raised.

Goal 1: Improve the Product Lifecycle

We wanted to take advantage of our recent move to daily deploys and increase the effectiveness of our product lifecycle. To achieve this goal, we added two capabilities:

  • The ability to have a monitored, gradual rollout of features so that users receive the highest-quality possible

Goal 2: Create a Unique Experience for Each Tier of Customers

At nerDigital, we manage multiple tiers of customers, from SMBs with a few sites to agencies that manage thousands of sites, with multivariate inner segmentation. We wanted an easy way to create subtle rules in order to control the granularity of the rollout and create a unique experience for each of the different tiers, one which will allow us to both grow and scale.

Goal 3: nerDigital’s Movement Toward Distributed Architecture

We create a new microservice at nerDigital once every few months. For this reason, we wanted to have a central platform for controlling flags that control various microservices, environments, and instances. Microservices and distributed systems may be polyglot (written in multiple languages) and implemented using different technology stacks, so it was important that, even if we use a different mechanism for service configuration, we have a single flagging platform that allows us to manage the flags in a unified way.

Goal 4: Auditing and Visibility

Over time, we noticed that the number of configurations we have vastly increased; different configurations have different variations in different environments and some of them were never used. This led us to the realization that we needed a better way to monitor the lifecycle of our flags and explain what a flag does and what changes occurred in which environment or segment.

Goal 5: Involving Domain Experts in the Feature Rollout

Previously, when a product manager, account manager, or support agent wanted to change the visibility of a flag for a segment or environment, they had to ask a developer to change it. This made developers the bottleneck in the process, and things became fragile when the complexity of the segmentation increased. To avoid this, we wanted the flags platform to have a comfortable UX that would allow people with no version control access to take relevant actions.

THE IMPLEMENTATION

We set up a group at nerDigital, one with both technical and business-oriented members, and came up with the requirements. In addition to considering in-house solutions, we also evaluated external feature flag mechanisms that could be integrated with nerDigital. We decided to go with a best-in-class third-party solution, so that our in-house resources could continue focusing on what we do best – giving customers an excellent website-building experience.

After making that decision, we found some solutions that answered our business requirements but to have the best user experience, we didn’t want to compromise on the technological aspects. So, we mapped out the technical requirements:

  • High Availability: As close as possible to 100%, in order to keep nerDigital’s high SLA.
  • Low Latency: Limited to milliseconds.
  • Scalability: The ability to serve a large number of flags, with a very large number of evaluations (billions of evaluations in live nerDigital sites).

After evaluating and comparing the solutions, we decided to go with LaunchDarkly, a scalable, highly available solution that serves more than 2 trillion feature flags every day.

Since we have multiple microservices, we wanted to have a single, simple way to evaluate flags. We achieved this by creating a simple library that receives some shared context about the customer in order to give each one a unique user experience by evaluating flags differently. As we have our monolith as our “Account service,” we used it in order to supply the shared context about the current customer, which is being cached on each of our microservices. By utilizing cache, we achieve local evaluations of the flag while the server connects to LaunchDarkly, receives data about the “flag rules,” and receives the account context from the monolith. Then, using the cache, we can evaluate the flag in memory and reduce the latency by avoiding access to the network.

The following diagram shows a simple flow of flag evaluation:

We also preferred not to lock ourselves to third-party availability (though LaunchDarkly has been very reliable until now) and added an additional caching layer that stores the flags and segments rules. This means that if the connection to LaunchDarkly is down for some reason, we can still serve our customers.

The flow of caching the rules engine can be seen in the following diagram:

EMBEDDING THE SOLUTION AT nerDigital

After the solution was implemented, we moved to education and technology adoption.

We held sessions with different departments, explaining to developers the guidelines and tools, showing product managers how to gradually roll out new features, teaching customer success agents how to manage specific customers.

We also implemented some tools on top of the integration. One of them is a tool that helps us reduce the number of flags, automatic mechanisms that notify when flags are ready to be removed from the code (according to the evaluations and state of the flags), in order to avoid cases like Knight Capital Group’s $460 million dollar mistake.

Today, we are serving 150 flags, which are evaluated thousands of times per second, from over a hundred instrumented instances. We have reached peaks of millions of events per second and everything worked as expected. Our team looks forward to growing and serving more features, having more Beta tests, and continuing to create the best possible user experience for each one of our customers. We hope you are too.

Recommended Posts