Blog

From hypothesis to impact: How we test and evolve relevancy models

By Sneha Popley

Getting relevancy right requires low latency, auction-level decisioning, and systems that scale reliably.

At Koddi, we collaborate daily with our customers through frequent check-ins on joint experimentation roadmaps. We build and refine custom models that reflect how their businesses really operate. There is no universal model at Koddi: what works for one client may not work for another, and what worked yesterday may not be what we deploy tomorrow.

As customers and context are ever-changing, so is relevance. Which is to say that Koddi’s models never truly “arrive” at a destination, but are continuously built, tested, and refined.

And no single model can do the job alone. For each client, we design a system of models that make auction-level predictions of things like click and conversion probability. Each plays a distinct role, and they must work together in real time to drive outcomes that matter, including incremental revenue and stronger performance for publishers and advertisers.

In this post, we walk through how we design and evolve our relevancy models with clients, how we introduce and validate new signals over time, and what a mature experimentation cadence looks like in practice.

Creating relevancy models aligned to unique outcomes

Building a relevancy multi-model system may sound complex, but in practice it’s a disciplined, incremental process.

As we create these unique models for customers, there’s a critical five-step process for ensuring we optimize relevancy for the right outcomes for each specific network, not just “performance” as may be defined for a traditional media network.

1. Define the system outcomes and program goals

The first step in working with our customers is defining what outcomes we’re building towards. In working across several different industries, from retail to finance to travel, we know that each industry and each network has unique goals. For an automotive marketplace, identifying and closing a repeat buyer might be the most important metric, whereas a food delivery network might want customers to gain exposure to new products for cross-sell and upsell purposes. The intake process of understanding and rallying behind the right goals for each network is foundational for our work.

2. Begin data ingestion

Data is the backbone of any good model. But getting the data right is harder than most think, and even small mistakes can lead to diminished outcomes. Preparing data for modeling starts with understanding what data you have and whether you can trust it.

That means agreeing on definitions, schemas, missing values, and the behavioral realities behind the data. If a field like “rank” doesn’t match what customers actually see, the models aren’t learning from real user data. Unreliable signals both add unnecessary noise and steer optimization toward unintended outcomes, so getting the data pipeline in the right place is essential.

3. Assign values to signals

Every network has troves of rich first-party data. Our next job is to figure out which data (or signals) matters most to predict and achieve the desired output.

In many commerce media systems, the most common feature families include:

Historical meta performance (clicks, impressions, conversions, and the rates derived from them)
Inventory context (placement, page type, device, time, geography, and other characteristics of the opportunity)
Audience information (group membership, intent, or propensity signals)
Interaction features (the intersection of a specific user group and a specific product or campaign)

As models are deployed, the values of these signals might change. While historical campaign data used to be paramount, it’s only one indicator for driving future performance, and accordingly, we may need to shift to add more audience information or inventory context. We re-align on what success means, what trade-offs are acceptable, and what are off-limits.

4. Signal tuning: test, ship, observe, and adjust

Most relevancy systems are updated infrequently, with changes shipped in large batches. We take a different approach: continuous experimentation cycles, where the ensemble is continuously monitored, evaluated, and evolved.

We frequently retrain and retune for three main reasons:

We get smarter and identify better predictors
Consumer behavior changes, or the market mix shifts
We get more data, including new signals that the publisher can provide, or we can integrate with other platforms.

Additionally, the feature’s value is refreshed continuously. If a feature is “last seven-day CTR,” the value updates as the last seven days roll forward. If a campaign’s performance tanks today, that change is reflected quickly because the underlying data feeding the feature is current. The models are not frozen in time just because the weights were trained earlier.

Models also pass through offline evaluation to measure predicted outcomes against observed behavior. This lets us catch regressions before they touch a live marketplace. A feature that looks promising in theory often reveals its edge cases with offline evaluation, and that’s exactly where we want to find them.

5. Data enrichment: Continuously send new data to inform features

The fastest way to improve a multi-model system is to improve its inputs. In practice, signals arrive through four paths:

Passed: Data sent directly in the ad request, such as placement context or audience membership that the publisher already knows at request time.
Inferred: Signals derived from information already present in the request, such as geography hierarchy or day-parting.
Learned: Signals that come from observed performance over time. This includes how products perform, how inventory slices behave, and how outcomes trend across conditions. These signals are the core training fuel that allows the system to improve as history accumulates.
Integrated: Signals retrieved at request time from approved external sources, such as a CDP. This is useful when the publisher has the signal, but not in the ad-serving pathway at the moment of the request. We can fetch what is needed in real time and apply it immediately.

Koddi works directly with publishers to determine which signals belong in which path — and how to get more of the right ones flowing over time.

How we test without guessing

Testing and iteration are delicate and require serious care. We never want to introduce a new feature or swap a component of the ensemble without strong evidence that it will improve performance. Here’s how we reduce risk before we go live:

Offline evaluation to validate system performance before exposure
A/B tests with holdouts to measure causal impact
Guardrails for latency, fill, and spend distribution to protect the user experience
Clear stop conditions and rollback plans

It may seem counterintuitive, but a neutral or negative experiment is as important as a revenue positive experiment on a mature program. A mature program also treats neutral and negative results as progress. If a feature does not help, we modify how it’s represented, try a few times, and possibly remove it to improve stability and reduce maintenance costs.

Safeguarding first-party data with privacy & security best practices

Privacy is paramount to every decision we make at Koddi, especially within our models. It shapes which signals we use, how we ingest them, and how we measure results. The overall goal is to improve relevance and outcomes without expanding data exposure. That typically means:

Favoring contextual signals, aggregated behavioral features, and group-level audiences over anything that could identify an individual
Keeping every network’s models distinct from one another, never sharing data across networks or using modeling from one to train another
Working directly with networks on the data they provide, and only considering the data they request to be included

For instance, some networks can pass a precise location; others can only pass regional context. We build and experiment on whatever the network provides.

Scale and latency aren’t buzzwords: They’re critical for success

At enterprise scale, model iteration stops being an optimization exercise and becomes the operating system for the marketplace. More placements, more categories, more geographies, and more advertiser diversity create “hidden segments” where performance behaves differently. If iteration depends on one-off rules or manual tuning, you end up with volatility: uneven outcomes by page type or region, unpredictable advertiser performance, and a constant cycle of firefighting.

Latency is the other half of the equation. We monitor average, p90, p99, and p99.9 latency per launch per network in milliseconds, and strive to keep them consistent with iterative model improvements. Auction-time decisioning has to incorporate the best available signals without slowing the request path or introducing fragile dependencies. That’s why signal strategy matters: what you pass directly, what we can infer, what the models can learn from history, and what we can integrate at request time. Instead of shipping changes without guardrails, we have a disciplined loop for adding, validating, and tuning signals. In other words, we can scale performance without scaling risk.

Bringing it to life with our customers: Compounding lift through signals + strong relationships

Let’s bring this together with a real example of how we’ve worked with a customer recently to get this right. Today, by building the right relevancy models (or Quality Score system), partners have seen growth of:

+40% CTR
+20% Revenue
+70% CVR

Getting to the next level required adding new signals, such as distance, customer history, price range, and more. We spent significant time with the customer identifying which signals would make an impact and how we could cleanly get that data.

The theme is the same: new signals expand what the models can understand, which expands what it can optimize. But we can’t simply add more data and expect better results. Our team had to go step-by-step, signal-by-signal, to ensure this would work.

This is what “iterating” really means. It is a portfolio of hypotheses tested with discipline, where wins compound and losses refine the system.

Closing: The models are the loop, and the loop is the partnership

If you take one thing from this article, let it be this: performance does not come from a one-time “model upgrade.” It comes from a shared cadence that makes adding and tuning signals routine. That is how you turn incremental improvements into compounding outcomes.

The models are not the product. The loop is the product. And that loop only works when it’s shared. While results are what matter most, what success looks like isn’t the same for every program. That’s where creative, flexible thinking that makes the work worthwhile.

GET IN TOUCH

Ready to get started?

Don’t let your brand get lost in the noise. Partner with Koddi to unlock the power of commerce media and transform the way you engage with your customers. Our team of experts is here to help you navigate complexities and develop a strategy that drives results — no matter what industry – in as little as 45 days.