FROM DATA PIPELINES TO INTELLIGENCE PIPELINES: BRIDGING DATA ENGINEERING AND DATA SCIENCE
Keywords:
data engineering, data science, machine learning pipelines, ETL, feature store, data drift, training-serving skew, feature engineering, ML in production, data qualityAbstract
Most machine learning projects fail before they ever reach users — not because the models are bad, but because the data pipelines feeding them are unreliable. This article looks at the gap between data engineering and data science, and why closing that gap is the single most important thing an organization can do to make its ML investments pay off. It walks through the three most common pipeline failure modes — ETL breakdowns, data latency, and feature drift — and explains what a well-built intelligence pipeline looks like in practice. The goal is simple: help teams build systems where good data and good models work together, not against each other.
References
[1] Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J., & Dennison, D. (2015). Hidden technical debt in machine learning systems. Advances in Neural Information Processing Systems, 28. https://proceedings.neurips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html
[2] Zaharia, M., Chen, A., Davidson, A., Ghodsi, A., Hong, S. A., Konwinski, A., Murching, S., Nykodym, T., Ogilvie, P., Parkhe, M., Xie, F., & Zumar, C. (2018). Accelerating the machine learning lifecycle with MLflow. IEEE Data Engineering Bulletin, 41(4), 39–45. https://cs.stanford.edu/~matei/papers/2018/ieee_mlflow.pdf
[3] Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017). The ML test score: A rubric for ML production readiness and technical debt reduction. Proceedings of the IEEE International Conference on Big Data. https://research.google/pubs/pub46555/
[4] Polyzotis, N., Roy, S., Whang, S. E., & Zinkevich, M. (2018). Data lifecycle challenges in production machine learning: A survey. ACM SIGMOD Record, 47(2), 17–28. https://dl.acm.org/doi/10.1145/3299887.3299891
[5] Chen, A., Chow, A., Davidson, A., DCunha, A., Ghodsi, A., Hong, S. A., Konwinski, A., Mewald, C., Murching, S., Nykodym, T., Ogilvie, P., Parkhe, M., Singh, A., Xie, F., Zaharia, M., Zang, R., Zheng, J., & Zumar, C. (2020). https://dl.acm.org/doi/10.1145/3399579.3399867