Data Day Texas: A Recap from our Engineering Team

This past January, our team attended Data Day Texas 2020 in Austin, one of the longest-running big data conferences in the world. This conference gave us a chance to look at some of the latest advancements and changes in the data world and taught us new ways to think about data-centric problems that our industry could face in the future. Working within the hospitality industry, we are constantly challenged to create value out of our data, and after attending Data Day Texas, our engineering team is well equipped to not only address these challenges with cutting edge tools but also inspired to better develop our own innovations.

Working Together as Data Teams

We began the conference by thinking about how data teams are currently organized and how they work and collaborate with other disciplines internally. The opening keynote, “Working Together as Data Teams” by Jesse Anderson at the Big Data Institute expanded on this topic greatly. Oftentimes an organization’s past projects have a tough time succeeding, with little understanding as to why. Anderson stated that 85% of data projects fail, for one reason or another, but this rate could be significantly reduced by developing well-supported data teams. Anderson argued that it is essential to gauge how much value existing data teams are generating from currently running projects. One of the quickest ways to determine how a data project is performing is by evaluating how much business value is being driven by this data. Business value can be separated into four different categories:

  • Good Value: The project is well received and generating value for the company.
  • Bad or Minimal Value: No value is being generated for the company.
  • No Value: This is bad or stagnate value; the project has stalled and no-one understands why, or the project’s purpose.
  • Planning: The value is building up to a hopefully good value.

To enable a team to get to that “good value”-generating point, the data team must be kept close to the business problem they are trying to solve so that they can keep their end goal in mind. Furthermore, the data team needs top-down support from the C-Suite down. Because of the nature of their work, the data team will need full organizational support to effectively generate real change. This could mean that changing who reports to whom in this data organization could affect how it grows. If an organization is large enough, it could also benefit from developing a data team that has different disciplines and focuses so that the team could address multifaceted problems. However, the most important thing to do within a data organization is to enable smart people on the team to create independently. Sometimes, teams need freedom to work without strict constraints to find solutions. Luckily, Koddi follows several of these principles within our organization, as we are quick to allow cross-team collaboration on projects, offer support from the top-down of our organization, and do everything we can to allow our smart and creative workers to develop effective business-first solutions using all of our available data.

How to Solve Data Pipeline Debt

One of the first presentations we attended turned out to be the most informative about one of Koddi’s prominent (but often overlooked) pain points, namely data pipeline debt. The presenter, Abe Gong of Superconductive Health, is one of the maintainers of Great Expectations, along with Taylor Miller; Great Expectations is an open-source python library that helps data teams deal with data pipeline debt.

Data pipelines are an essential component of any data science project, but they are often untested, unverified, and undocumented. Just as production code or even data can be volatile, unverified assumptions about the underlying code can corrupt data quality, drain productivity, erode trust in the data, and lead to accruing pipeline debt. One of the best ways to understand the necessity of a concrete solution for these issues is the analogy of data pipelines to software packages. Using software packages is sometimes necessary and often expedites your development process, but can add bulk and complexity to your project that could either be completely avoided or minimized. Software package management and data pipeline debt both need to have guardrails to prevent unnecessary technical debt and keep your projects healthy.

With Great Expectations, expectations about the schema and statistics of datasets are structured in configuration files, which, in turn, can be rendered directly into human-readable documentation. In this workflow, documentation is rendered from tests, and tests are run against new data as it arrives, which means the documentation is guaranteed to never go stale. A very interesting new feature is the automated data profiling, which uses raw data to automatically generate a first draft of test suites, allowing users to explore data faster as well as capturing knowledge for the future. Great Expectations has been on Koddi’s radar for quite some time now, but at the time we started using it, it was missing some important components, like support for PySpark and flexible test suites. We were very pleased to hear that these improvements are already part of the library, along with some other useful ones. The library has a growing community that supports it, and Koddi could become a part of it. Although Great Expectations might need some time to become an industry-standard itself, the idea it promotes unequivocally has to become part of the data engineering process.

Building Your Data Pipeline

Speaking of data pipelines, a common way to quickly bolster data team effectiveness is by garnering support from the software engineering organization to create a good data pipeline to process and bring in data that the data team can use. The presentation In “Immutable Data Pipelines for fun and profit” by Rob McDaniel discussed some axioms and anti-patterns common to the construction and development of a data ingestion process. In this process, one of the key questions is: “Is the data pipeline flexible enough to survive mistakes?” As when dealing with data and development, there will always be mistakes and opportunities to improve upon. These opportunities can be broken into several different axioms:

  1. Data never changes. When we deal with data, we want to create data that is static and as stateless as possible to remove any business logic and conditionals from our data pipeline. This leads to one of our first anti-patterns, which is data mutation. Commonly we manipulate our data in our data lake, or before when the data resides in our data warehouse. When this data is mutated, the clear paths to the source of the data are removed, complicating rollback processes as well as obfuscating any possible auditing processes. It’s important to think of data as unchanging and to consider a process that encapsulates all data, updates, adds, or deletes.
  2. If you know more now than you did then, you probably know less now than you will. This point can even step beyond data pipelines but is especially relevant here. Whenever you deal with data pipeline design, you need to think into the future, which may involve different data schema. It’s important here to not presuppose the data schema and to allow for the data to live untouched for as long as possible in the pipeline. This avoids our second anti-pattern, premature normalization, which is when data is mutated too early, causing problems further down the line.
  3. Business logic is a false idol. There is a concrete logical reason behind this dictum. Businesses are organic and imperfect organizations, and their requirements grow and change from year to year. Thus, one must be thoughtful about avoiding our third anti-pattern, fossilization. We want to keep in mind how others might inherit and begin contributing to our data pipeline, and keeping this old, fossilized and unusable code in a repository endangers this goal.

All these points to say, you should be designing with extensibility, flexibility, and enough space to survive your mistakes. Koddi has continually made an effort to enhance and harden our data pipeline process so that we receive data in a clean and simple delivery, as well as allow our developers to enhance and harden our process as we learn more information than we had when designing it.

Machine Learning with MLflow

In the presentation “MLflow: An open platform to simplify the machine learning lifecycle”, Corey Zumar offered an overview of MLflow, a new open-source project from Databricks that simplifies the process of deploying machine learning models. The lifecycle of a data science model is long and the development of the model is only a small part of the overall process. Oftentimes, the most time-consuming part is the process of moving from the proof of concept stage and putting these models into production. In addition, it is much harder to enable other data scientists (or even yourself, one month later) to reproduce a pipeline, compare the results of different versions, track what’s running where, and redeploy and rollback updated models.

Zumar’s presentation was a step-by-step demonstration of using MLflow’s API to make a simple deep learning model production-ready. The API was relatively easy to use, and at the end of this short process, the API offered a way to track experiment runs between multiple users within a reproducible, cloud-based environment that can manage the deployment of models to production. The problem of simplifying and streamlining the machine learning lifecycle has been gaining increasing attention recently, with Google, Facebook, Uber, and Airbnb creating their own frameworks to deal with the issue. Where MLflow seems to shine compared to some of these alternatives, is that it supports many different training tools, libraries, and environments, and their integration to the API is seamless.

Cost-Optimized Data Labeling Strategy

Another interesting presentation was Jennifer Prendki at Alectio’s “Cost-Optimized Data Labeling Strategy”. One of the conference’s main tracks was Human-in-the-Loop Machine Learning, and this presentation was on that track as it mainly addressed challenges related to active learning. Active learning is a design for training based on the belief that a machine learning algorithm could potentially achieve a better accuracy while using fewer training labels if it were allowed to choose the data it wants to learn from. As opposed to the traditional supervised learning approach, this design is faster and more cost-effective.

The presentation addressed an assumption frequently made in active learning, which is that all provided labels are equally correct. Alectio’s framework offers a way of dealing with this issue, in a manner that seemed to combine a voting process among the different labelers, labeling of rating, and averaging of the results. Along the same lines, there was a discussion about the tradeoffs between the size of a training set and the accuracy of the labeling process, the conclusion of which is that, in many cases, adding more data (with the noise that comes with) does not lead to a proportional increase in the model’s accuracy. Although there is no immediate need at Koddi to use a framework like the one proposed, it is undeniable that we have a lot of information to analyze and there are many projects that would require annotated labels in order to produce a sophisticated model. In these cases, ideas like the ones presented would be extremely useful.

There was a lot more we at Koddi Engineering picked up at Data Day Texas, but these were the insights that resonated with us the most. We are continually striving to find new and innovative ways to create more value from our data. If you’re working to develop data projects, hopefully, these ideas will help you to create an agile and effective data team that is well equipped to deliver true business value for your company.

Join Our Team

We’re currently hiring for engineer roles across multiple offices. Check out our Careers page to learn more about life at Koddi and apply today!

Categories
Technology