Part II: Challenges in Building a Real-Time Feature Engineering System

Ajay
5 min read1 day ago

--

Building real-time feature engineering systems isn’t as straightforward as it seems. At first glance, it may feel like just another microservice, pulling in data and serving it on demand. But when you scratch beneath the surface, you’ll find a web of challenges — from data inconsistency and feature disparity to maintaining scale and performance.

In this post, we explore the key challenges engineers face before even starting to build a real-time feature engineering system. This isn’t just about implementing another feature store; it’s about understanding the intricacies of data pipelines, misgauged features, and more.

In the last post, we discussed how feature engineering is a critical piece in deploying realtime models. In terms of feature freshness, features could be divided into two types

  • Batch features are calculated periodically beforehand and used during real time prediction.
  • Realtime features are calculated during real time prediction.

From an engineering perspective building systems for real time features is a hard problem. And there doesn’t exist an open source solution which completely solves it. In this post we are going to discuss different challenges faced building these systems.

Before we deep dive into it, let’s discuss how would such an engineering solution would look like:

  • Product engineering teams will produce data in different messaging queues like kafka.
  • Feature engineering system will ingest this data in realtime
  • Feature engineering system will provide an API to query this data in different ways to build features and serve them in real time.
typical feature engineering pipeline

On a high level it doesn’t look like a complicated system, but once we have to deploy a lot of high quality models, it quickly becomes a scaling problem. As every model has different requirements of features and it leads to designing these systems for every model, which leads to inefficient use of time and resources.

Now, let’s understand what are the challenges faced in designing these systems.

1. Misgauged features: Features during training of the models are calculated wrongly, which leads to poor quality of models. There are multiple reasons for it,

  • Data scientists mostly get the data sources from analysts, read more. Generally these data sources have additional logic built upon by different data warehouse or business analyst teams (raw layer-> data layer -> presentation layer etc). In fact these data source layers are not designed for data science use cases, rather business use cases.
  • Even if data scientists use raw layers, they get third hand information about the data semantics. Because they sit logically very far from product engineering, which means that data semantics is poorly understood by them. In most of the organizations, sources of information for data semantics are analysts who themselves have second hand information.
  • Sometimes these stale data sources might have precalculated data points, which might not be possible to calculate in real time.

2. Feature disparity: The query language used in training feature engineering is different from used in real time feature engineering. As most of the time training uses OLAP solutions like bigquery, snowflake etc. meanwhile realtime serving uses OLTP solutions like bigtable, Hbase, postgres etc.

  • Bigger and better models might have very complex features, hence complex queries are used for generating these in the training phase. So translating them into another query language becomes a problem.
  • Most of the time realtime serving queries are written by people who didn’t create the queries for training. This means there are always chances of query mismatch happening, leading to feature disparity, hence leading to model performance diversion from training results.

3. Non-reusable features: Data scientists across teams create features which could have been created by other data scientists but there is no central repository to check whether a certain feature already exists or not. This leads to unnecessary creation of repetitive features, wasting resources.

  • Even if some data scientists can create features in real time, and other data scientists are aware about it, there is always a low confidence to reuse the feature because of complexity and data source of the features.
  • There is always a possibility that the original feature creator can change the feature logic, leading to impact on other data science models which reuses the same features.

4. Feature complexity: During the training phase, we have a lot of compute time to calculate a feature, and data is stale, so using OLAP solution is a very sane and simple thing to do. In real time serving both of these situations are reversed, we have to calculate hundreds of features in subseconds on top of dynamic data.

5. Dynamic data: If our data is in messaging queue

  • Lot of different teams are involved in producing this data, any production issue happening in these producers will take a hit in our feature engineering pipeline.
  • If a whole pipeline is impacted, we can still take actions, stop usage of prediction etc. This becomes an even dire situation when a particular category of data is dropped because of a production issue or production feature rollout. For example, drop in certain types of a transaction in the whole transaction data kafka stream because of production issues.
  • Monitoring these data pipelines with minimum alert fatigue is very crucial for proper utilization of resources.

6. Feature monitoring: Even if there is data monitoring in place, we still have to make sure features are not changing because of production issues or business changes as discussed earlier. For example, if a new and better transaction type is introduced in the system, which leads to removal of existing transactions, this leads to impact on existing features hence existing models.

Conclusion

Creating a real-time feature engineering solution is more than just plugging in data pipelines and APIs. It requires addressing these foundational problems to ensure that the models you serve are as robust and accurate as they can be. In the upcoming posts, we’ll discuss potential solutions and how we can approach this problem from a scalable and sustainable engineering perspective.

--

--