DATABRICKS: The Glue We Were Missing

Share on facebook
Share on twitter
Share on linkedin

Notebooks & the day after

When you throw a fresh dataset at a Python data scientist the first thing he or she will do is spin up a Jupyter notebook and dig in. Notebooks offer you the freedom to run and tweak a block of code until you’re happy with it, add some nicely formatted documentation with a plot and then move on to the next code block. In these first exciting days of a project, data will be cleaned and explored visually, pre-processing code will be written, features engineered, models selected and hyperparameters tuned.

After the buzz of this notebook frenzy where new things are learned, and progress is made follows the comedown that is about as enjoyable as filing your taxes: Porting the code outside of the notebook for production, integrating it with whatever comes before and after in the pipeline. For an R data scientist, a similar struggle can take place when code lives in an R Markdown notebook.

Notebooks simply don’t integrate well into larger pipelines, and therefore the longer you postpone the backport, the more your technical debt grows. Therefore, the only notebook tracks that aren’t ephemeral are documentation and reporting at the end of data pipelines.

With the maturation of the Databricks platform, this is about to change. Databricks allows you to take your notebook projects to a whole new level, and in this blog post, I’ll tell you how we did just that in one of our latest experiments.

From Python versus R to a combined toolkit

Over the past decade, both R and Python data science communities have flourished after high-quality open source tools filled a gap in a rapidly growing data-driven market. Unfortunately, because of functional overlap in a few of the most popular libraries, such as Pandas and Dplyr, the sentiment has been cultivated that a struggle for dominance is going on between both languages. The creators of said packages, Wes McKinney and Hadley Wickham, respectively, never believed that anything good could come from antagonizing both communities and have publicly spoken out against it.

Recently, this message seems to be sinking in at last, and promising initiatives are being set up by these two visionaries to build cross-platform toolkits.

The Databricks platform seems one of the most promising early adopters of this changing mindset. Not only can you create both R and Python interactive notebooks, but the execution language can also be adjusted on a per-cell basis by initializing them with %r or %python magic commands.

This is precisely what we did for one of our latest projects—allowing us to select the best parts of both toolkits for different parts of the analysis. We ended up building a product recommendation engine and customer segmentation algorithm in Python and sales forecasts with auto-ARIMA plus ggplot2 graphics in R.


Since Databricks is a managed service within Azure, the setup was done in a matter of minutes. We added an Event grid listener to pick up datafile uploads on the blob storage and then executed an Azure function call to the Databricks API. This call included details on the type of cluster to use and which packages to load. The customer then received emails with the advice provided by our different analyses within the Databricks notebook.

Databricks natively leverages Spark for big data crunching. While this is one of the strong suits of the platform, we didn’t use this part of the stack for this project as the dataset was sufficiently small to load into memory.


Databricks allows data scientists to bring their analyses to the customer without venturing outside of their notebook's comfort zone. The cross-language feature enables you to use the best of both R and Python worlds. We’re not planning to stop writing code outside of notebooks anytime soon but are definitely happy to have this trick up our sleeves.

Jeroen Boeye, PhD
Head of Machine Learning
About the author

Jeroen is leading the Machine Learning team on a mission to bring value to our customers using data. The team uses Computer Vision, Reinforcement Learning, and Natural Language Processing to reach that goal.

Related blog posts

With the global artificial intelligence (AI) market expected to be at almost $60 billion by 2025, many applications transform everyday life. Organizations seek ways to sustain their competitiveness in the marketplace by using AI to power what people see online, purchase products, and provide personal recommendations. As consumers realize the benefits of AI to them, they are more willing to share personal data, giving businesses a fantastic opportunity to innovate.
The Jane smart alert system has matured through several iterations with testers providing invaluable feedback that allowed the system to reach the accuracy it has today.
While simple in nature, averages are tricky and deceptive when misused, the variance in your data is a treasure!


Curious to learn what we can do for you?

Scroll to Top

Inquiry for your POC