The world knows CARTO as the leading platform for geospatial analysis in the modern data stack. But since 2019, the company has built one the largest repositories of geospatial data products through its Data Observatory. It’s a SaaS company with a substantial data business supporting its growth.
We talk with Javier Pérez Trufero, VP Product & Data at CARTO, about the experience of building data products in a software company, what it means to make software and data products “cloud native,” and how his team is automating fulfillment for its 11,000+ data products using the Bobsled API.
CARTO has invested substantially in a “cloud-native” future. Walk us through the vision.
We started to explore the switch to a fully “cloud-native” strategy and architecture back in 2020. CARTO was originally built as a managed PostGIS service in the cloud with a bunch of different analytical tools and visualization libraries layered on top. This worked great, but it required our users to constantly move their data in and out of our database. That created governance and data management challenges, particularly as our bigger customers migrated more and more workloads to the modern data stack, leveraging Snowflake, Google BigQuery, Databricks and the other cloud data platforms.
At first, we built connectors and integrations into all of these platforms; but we quickly realized that we were fighting “data gravity,” so to speak. So in 2021, we re-platformed CARTO to run entirely on top of our customers’ existing cloud data platforms (e.g. Snowflake, BigQuery, Amazon Redshift, Databricks). The bet has really paid off. We’re seeing huge benefits for our customers. It’s massively more scalable and they do not need to move data out of their core data platform to use CARTO.
CARTO is a SaaS product, but offers thousands of data products to its customers through Data Observatory, its own data marketplace. What role do data products play in CARTO’s product strategy?
Our data and software products are deeply intertwined. We’re a software company at our core, but our customers often need external data to maximize the value from our software. They might be interested in, for instance, using CARTO to predict risk rates by location; we can provide the tooling to complete the analysis but it often requires access to demographic, environmental or other data to inform the model.
Early on, we spent a good amount of time working with users to find this data and quickly realized that the [data] industry, in a way, was broken. The time our customers would need to spend to figure out what data was available, identify the right vendors to approach, negotiate terms and then get those data products up and running was not only painfully long; it limited what they could eventually do with our product. That’s why in 2019 we launched the Data Observatory. We started with a few public and premium datasets but now offer access to over 11,000 free and paid data products.
How has the re-platforming impacted the way you bring your data products to market?
When we re-platformed in 2021, we created a new set of challenges with our data products. Before, we managed the database, so “delivery” was never an issue because the products lived in our database. But the cloud-native approach meant that our customers were now using their own data warehouses that were running into different regions of different platforms across the world. Suddenly, distribution got a lot more complicated.
Can you share a bit about the delivery experience you are looking to build for your customers?
The end goal here is for our customers to be able to start working with the data they need as quickly and with as little work on their end as possible. Part of the challenge here is not even a technical problem, per se: it has to do with the way data is bought and sold. Since data is priced based on a range of levers – including the customer and their use case – the actual purchase process requires some degree of human involvement.
But there are parts we can control. We’ve always been committed to the idea of automated fulfillment: once a customer has the right to use a data product, they should be able to start using it immediately. That means we needed to build a system to share the data into their data warehouse regardless of platform or region. This is an experience that we can build with the right technology and where we’ve been really excited to work with Bobsled.
How did you go about building automated fulfillment?
As a technology company, we initially started by trying to build this ourselves. Our product and engineering teams spent months specing out how it might work. We recognized pretty early on that we could not replicate data in every cloud and region; instead, we needed a centralized system to manage the 11,000+ assets from our source in GCP.
After a few months of work, we had two important realizations. First, it’s an extremely hard problem: there’s a huge amount of complexity working across multiple regions of multiple platforms for multiple customers. Second, this was not our core capability. It’s a gap, but it’s not what we are - or want to be - experts in. We’re experts in how companies can get the most out of their geospatial data using the modern data stack; not cross-cloud data distribution.
Talk a bit about what you’re building with Bobsled.
We’re really excited about what we’ve been able to build with Bobsled. Today, we’re using the Bobsled API to automate fulfillment directly from the internal Retool app that we’ve built to manage our data operations. When a customer signs, our operations team will be able to select the data subscription, pick the destination and then initiate a delivery by clicking a button within the app. Neither our operations team nor our customers ever see Bobsled, but it powers the entire experience.
One of the most important aspects for us is that Bobsled shares our belief that any solution to this problem has to build on the native sharing protocols being developed by these platforms. Writing directly into a database misses a huge amount of innovation that Snowflake, Databricks and the other platforms are doing to improve the way customers access and share data on their platforms.
When it comes to data products, what part of the user experience is your team focused on improving next?
One area of focus for us is how we can improve the discovery phase for our data products. We now have over 11,000 data products listed in our catalog, and it’s difficult for our users to know which product is best for them. We’re interested in seeing if we can expand our use of Bobsled to help support on-demand access of trial data, which is a great way for users to see the data first hand.
We’re also exploring how we can use generative AI to help solve the discovery problem. Instead of browsing thousands of listings, a user could simply describe the type of data they are seeking, and we could return the best datasets for them. It’s a great use case, and I think has a lot of potential.