RevoData

RevoData achieves Gold Databricks Partner status

Revodata — Mon, 09 Feb 2026 10:17:22 +0000

Amsterdam, February 2026 – RevoData, the Amsterdam-based data and AI consultancy, has achieved Databricks Gold Partner status (formerly named Elite). This achievement highlights the extensive collaboration between the two companies and acknowledges RevoData’s deep expertise in building high-quality solutions on the Databricks platform.

“Becoming a Gold Databricks Partner has been a long-term goal for RevoData. To achieve this is truly an honor and only possible through the dedication and effort of our team. We look forward to further deepening our collaboration with Databricks and building revolutionary solutions for our customers.” – Ralph Kootker, CEO of RevoData

RevoData rose from Registered Partner to Gold Partner in four years through innovative solutions, a dedication to growth, and continued value delivery. This combination resulted in one of its geospatial solutions, used for flood prevention by applying AI on drone imagery, to win a recognized industry award for innovation. RevoData’s further achievements include:

Six Databricks Champions, the highest number in the Netherlands
Providing training for a wide range of companies through the Databricks Training Program
Having all consultants 100% certified in their field of expertise

Looking ahead, RevoData will continue to strengthen its core services[RK8] across the Nordics and Benelux while doubling down on its expertise in geospatial solutions and LLMOps.

“Geospatial and LLMOps are two areas in which I believe our knowledge can serve the market. Through our deep expertise and focus on Databricks, we take companies beyond the hype, elongating model lifecycles while increasing value delivery. Where AI is already booming, the use cases for geospatial data are slowly but steadily growing. With our in-house experts, we bridge the knowledge gap to bring use cases in production” – Ralph Kootker, CEO at RevoData

Over RevoData

RevoData is a Dutch consultancy firm providing solutions that accelerate innovation, are user-friendly, and are built on Databricks. Through its breadth of expertise, the company builds sustainable solutions that drive business value. RevoData provides training and guidance to ensure companies can effectively use and maintain delivered projects.

Eyes to the Sky: LiDAR Point Cloud in Databricks for Urban Canopy Insights

Revodata — Fri, 23 Jan 2026 14:32:43 +0000

Editor’s note: This post was originally published June 20th, 2025.

What is the Sky View Factor and why does it matter in our cities?

I’m back with another sunny, summer-inspired geospatial adventure!
Ever wonder how much sky you can actually see when you’re standing on a busy city street or under a cluster of city trees? That visible slice of sky called the Sky View Factor (SVF) has a surprisingly big say in how cities heat up, cool down, and even how comfortable we feel outside. The less sky you see, the more heat gets trapped between buildings and trees, creating those urban heat islands, where city temperatures climb higher than surrounding areas. These heat islands don’t just make summer days unbearable, they can worsen air pollution, increase energy demand for cooling, strain public health by amplifying heat-related illnesses, and even accelerate the wear and tear on city infrastructure.

Measuring SVF takes more than just a weather app or a quick satellite snapshot, you need a detailed, 3D, street-level view of the city. That’s where LiDAR point cloud data shines, capturing billions of laser-scanned points from every rooftop, treetop, street, and sidewalk. This treasure trove of 3D data lets us model the urban canopy, the intricate layer of buildings and greenery that controls how sunlight and air flow through the cityscape. From this, we can calculate SVF and generate cool fisheye plots that show exactly how much sky you’d see lying anywhere on the ground.

Back in my master’s program, some classmates and I built an app for the municipality of The Hague that let users click on a map or upload a list of points to get SVF values and how much the sky is blocked by buildings and trees. You can check out our full report with the front-end and backend code here. Back then, we ran the calculations using NumPy arrays and plenty of good old-fashioned for-loops. I’m now reworking the project using PySpark, with a focus on scalable, data warehousing for big data analytics rather than real-time, on-the-fly processing, without delving deeply into the underlying mathematical computations. That’s where Databricks shines: its cloud-native platform effortlessly handles massive LiDAR datasets, turning mountains of 3D points into quick, actionable insights. Whether you’re a city planner aiming to cool down urban streets or simply a curious urban data explorer, it’s never been easier or more fun to look up and ask: how much sky do we really see?

Implementation:

The following implementation is part of the training we offer at RevoData focused specifically on leveraging the geospatial capabilities of Databricks.

In this implementation, I want to focus on code refactoring and explore the options available to start with the low-hanging fruit, highlighting which parts of the code can be adapted for distributed processing with minimal changes.

Let’s get started!

Datasets

As I mentioned earlier, we originally used point cloud data from the City of The Hague for this project. But today, I’m taking you to Washington, partly to switch up the scenery for myself, and partly so you can download the data more easily from an English-language website. The data is fully available here. I also generated a grid of 576 points across the area, which we’ll use to calculate the Sky View Factor (SVF).

Import libraries

In this implementation, we need couple of libraries which I import in one go:

				
					import math
import numpy as np
import boto3
import os
import matplotlib.pyplot as plt
import pdal
import json
import io
import pyarrow as pa
from pyspark.sql.functions import col, sqrt, pow, lit, when, atan2, degrees, floor
from pyspark.sql.types import StructType, StructField, DoubleType, FloatType, IntegerType, ShortType, LongType, ByteType, BooleanType, MapType, StringType, ArrayType
import pandas as pd
from sedona.spark import *
from pyspark.sql import functions as F
from pyspark.sql.window import Window
import base64
from PIL import Image

Data ingestion

For the generated grid points, Apache Sedona makes geospatial data ingestion remarkably easy.

				
					
config = SedonaContext.builder() .\
    config('spark.jars.packages',
           'org.apache.sedona:sedona-spark-shaded-3.3_2.12:1.7.1,'
           'org.datasyslab:geotools-wrapper:1.7.1-28.5'). \
    getOrCreate()

sedona = SedonaContext.create(config)

# The path to the grid geopackage and point cloud las file
dataset_bucket_name = "revodata-databricks-geospatial"
dataset_input_dir="geospatial-dataset/point-cloud/washington"
gpkg_file = "grid/pc_grid.gpkg"
pointcloud_file = "las-laz/1816.las"
input_path = f"s3://{dataset_bucket_name}/{dataset_input_dir}/{pointcloud_file}"

# Read the grid data
df_grid = sedona.read.format("geopackage").option("tableName", "grid").load(f"s3://{dataset_bucket_name}/{dataset_input_dir}/{gpkg_file}").withColumnRenamed("geom", "geometry").withColumn("x1", F.expr("ST_X(geometry)")).withColumn("y1", F.expr("ST_Y(geometry)")).select("fid", "x1", "y1", "geometry")

num_partitions = math.ceil(df_grid.count()/2)

To ingest point cloud data, we can use libraries like laspy or PDAL. In this case, I used PDAL, applying a few read-time optimizations to efficiently convert the output array into a PySpark DataFrame:

				
					
def _create_arrow_schema_from_pdal(pdal_array):
    """Create Arrow schema from PDAL array structure."""
    fields = []
    
    # Map PDAL types to Arrow types
    type_mapping = {
        'float32': pa.float32(),
        'float64': pa.float64(),
        'int32': pa.int32(),
        'int16': pa.int16(),
        'uint8': pa.uint8(),
        'uint16': pa.uint16(),
        'uint32': pa.uint32()
    }
    
    for field_name in pdal_array.dtype.names:
        field_type = pdal_array[field_name].dtype
        arrow_type = type_mapping.get(str(field_type), pa.float32())  # default to float32
        fields.append((field_name, arrow_type))
    
    return pa.schema(fields)

def _create_spark_schema(arrow_schema):
    """Convert PyArrow schema to Spark DataFrame schema."""
    spark_fields = []
    
    type_mapping = {
        pa.float32(): FloatType(),
        pa.float64(): DoubleType(),
        pa.int32(): IntegerType(),
        pa.int16(): ShortType(),
        pa.int8(): ByteType(),
        pa.uint8(): ByteType(),
        pa.uint16(): IntegerType(),  # Spark doesn't have unsigned types
        pa.uint32(): LongType(),     # Spark doesn't have unsigned types
        pa.string(): StringType(),
        # Add other type mappings as needed
    }
    
    for field in arrow_schema:
        arrow_type = field.type
        spark_type = type_mapping.get(arrow_type, StringType())  # default to StringType
        spark_fields.append(
            StructField(field.name, spark_type, nullable=True)
        )
    
    return StructType(spark_fields)


def pdal_to_spark_dataframe_large(pipeline_config, spark, chunk_size=1000000):
    """Streaming version for very large files."""
    pipeline = pdal.Pipeline(json.dumps(pipeline_config))
    pipeline.execute()
    
    # Get schema from first array
    first_array = pipeline.arrays[0]
    schema = _create_arrow_schema_from_pdal(first_array)
    
    # Create empty RDD
    rdd = spark.sparkContext.emptyRDD()

    
    # Process arrays in chunks
    for array in pipeline.arrays:
        for i in range(0, len(array), chunk_size):
            chunk = array[i:i+chunk_size]
            data_dict = {name: chunk[name] for name in chunk.dtype.names}
            arrow_table = pa.Table.from_pydict(data_dict, schema=schema)
            pdf = arrow_table.to_pandas()
            chunk_rdd = spark.sparkContext.parallelize(pdf.to_dict('records'))
            rdd = rdd.union(chunk_rdd)
    
    # Convert to DataFrame
    return spark.createDataFrame(rdd, schema=_create_spark_schema(schema))

				
					
pipeline_config = {
    "pipeline": [
        {
            "type": "readers.las",
            "filename": input_path,
        }
    ]
}

# Convert point cloud array to Spark DataFrame
df_pc = pdal_to_spark_dataframe_large(pipeline_config, spark)
df_pc = df_pc.withColumn("geometry", F.expr("ST_Point(X, Y)"))
df_pc.write.mode("overwrite").saveAsTable(f"geospatial.pointcloud.wasahington_pc")

Identifying point cloud data surrounding each grid point

Next, we need to retrieve all point cloud data within 100 meters for buildings and high vegetation, these are used for SVF calculation. For ground points, we only consider those within 10 meters, as they’re used solely to estimate the elevation of each grid point.

				
					df_selected = df_pc.select("X", "Y", "Z", "Classification")

dome_radius = 100
height_radius = 10

# Register as temp views
df_pc.createOrReplaceTempView("pc_vw")
df_grid.createOrReplaceTempView("grid_vw")

# Perform spatial join using ST_DWithin with 100 meters
grid_join_pc = spark.sql(f"""
    SELECT 
        g.fid, 
        ST_X(g.geometry) AS x1,
        ST_Y(g.geometry) AS y1,
        p.classification,
        p.x AS pc_x,
        p.y AS pc_y,
        p.z AS pc_z,
        ST_Distance(g.geometry, p.geometry) AS distance,
        g.geometry AS g_geometry,
        p.geometry AS pc_geometry 
    FROM grid_vw g
    JOIN pc_vw p
        ON ST_DWithin(g.geometry, p.geometry, {dome_radius})
    WHERE p.classification IN (5, 6) OR (p.classification = 2 AND ST_DWithin(g.geometry, p.geometry, {height_radius}))
""")

Estimating grid point elevation from nearby point cloud data (10m Radius)

Here, we determine the elevation of each grid point by identifying its dominant surrounding class, either building or ground, and then computing the average elevation of that class within a defined radius.

				
					# Filter only classification 2 and 6 and count occurrences of (fid, classification)
grouped = grid_join_pc.filter(
    (F.col("classification").isin(2, 6)) & (F.col("distance") <= height_radius)
).groupBy("fid", "classification").count()

# Define window: partition by fid, order by count descending
window_spec = Window.partitionBy("fid").orderBy(F.desc("count"))

# Apply row_number
ranked = grouped.withColumn("rn", F.row_number().over(window_spec))

# Compute the average elevation for each grid point using nearby point cloud data within a specified radius.
grid_pc_elevation = grid_join_pc.join(g_classification_df, on=["fid", "classification"]).filter(
    (F.col("distance") <= height_radius)
).groupBy("fid").agg(
    (F.sum("pc_z") / F.count("pc_z")).alias("height")
)

# Combine point cloud data with classification info and computed height, optimized with repartitioning.
grid_pc_elevation_all = grid_join_pc.withColumnRenamed("classification", "p_classification").join(g_classification_df, on=["fid"]).join(grid_pc_elevation, on=["fid"]).repartitionByRange(num_partitions, "fid")

# Filter out ground points (e.g., class 2) to retain only buildings and high vegetation points for analysis.
grid_pc_cleaned = grid_pc_elevation_all.filter("p_classification != 2").repartitionByRange(num_partitions, "fid")

Creating the dome, generating the plot, and calculating the SVF

The dome is a representation of the sky, going from the horizon all the way to the zenith (directly on top) of the viewpoint. The dome can be split into sectors based on horizontal and vertical directions, in essence creating a dome-like shaped grid. The units used to split the sectors are 2 degrees horizontally (azimuth angle), and 1 degree vertically (elevation angle), which are considered as appropriate values for calculation.

To calculate the Sky View Factor (SVF), point cloud data is projected onto a dome divided into sectors, marking which sectors are blocked from view. The closest point in each sector determines the obstruction, and if that point is a building, all sectors below it in that direction are also considered blocked. The unobstructed proportion of the dome’s area gives the SVF. For clarity, the results are visualized in a circular plot showing which sectors are clear sky or obstructed by buildings or vegetation, oriented to the north for easy interpretation.

				
					# Calculate raw azimuth angle (in degrees) from each grid point to each point in the point cloud.
# Shifted by -90 to align with the 0° direction being north.
grid_pc_az = grid_pc_cleaned.withColumn(
    "azimuth_raw",
    degrees(F.atan2(F.col("pc_y") - F.col("y1"), F.col("pc_x") - F.col("x1"))) - 90
)

# Normalize azimuth angle to fall within the range [0, 360).
grid_pc_az = grid_pc_az.withColumn(
    "azimuth",
    when(F.col("azimuth_raw") < 0, F.col("azimuth_raw") + 360).otherwise(F.col("azimuth_raw"))
)

# Drop the temporary azimuth_raw column to clean up the DataFrame.
grid_pc_az = grid_pc_az.drop("azimuth_raw")

# Calculate the elevation angle (in degrees) from the grid point to each point in the point cloud.
# Height is divided by 1000 to convert from millimeters to meters if necessary.
grid_pc_az = grid_pc_az.withColumn(
    "elevation",
    degrees(F.atan2(F.col("pc_z") - F.col("height") / 1000, F.col("distance")))
)

# Bin azimuth angles into 2-degree intervals (0–179 bins for 360°).
grid_pc_az = grid_pc_az.withColumn("azimuth_bin", F.floor(F.col("azimuth") / 2))

# Get the minimum elevation angle across all records to define the lower bound of elevation bins.
min_val = F.lit(grid_pc_az.select(F.min("elevation")).first()[0])

# Get the maximum elevation angle across all records to define the upper bound of elevation bins.
max_val = F.lit(grid_pc_az.select(F.max("elevation")).first()[0])

# Compute bin width by dividing elevation range into 89 equal parts (90 bins total).
bin_width = (max_val - min_val) / 89

# Bin elevation angles into 90 intervals, ensuring they stay within the [0, 89] range.
grid_pc_az = grid_pc_az.withColumn("elevation_bin", 
    F.least(
        F.greatest(
            F.floor(
                (F.col("elevation") - F.lit(min_val)) / 
                F.lit((max_val - min_val)/90)
            ).cast("int"),
            F.lit(0)  # Clamp minimum bin index to 0
        ),
        F.lit(89)  # Clamp maximum bin index to 89
    )
)


# Define a window that partitions the data by azimuth and elevation bins,
# and orders points within each bin by their distance to the grid point.
window_spec = Window.partitionBy("azimuth_bin", "elevation_bin").orderBy("distance")

# Assign a row number within each azimuth-elevation bin, so the closest point (smallest distance) gets rank 1.
df_with_rank = grid_pc_az.withColumn("rn", F.row_number().over(window_spec))

# Keep only the closest point (rank 1) in each bin and drop the temporary rank column.
# Then repartition the result by 'fid' to optimize parallel processing in subsequent steps.
closest_points = df_with_rank.filter(col("rn") == 1).drop("rn").repartitionByRange(num_partitions, "fid")

				
					
def create_dome(pdf: pd.DataFrame, max_azimuth: int = 180, max_elevation: int = 90) -> np.ndarray:
    """
    Creates a dome matrix based on azimuth and elevation bins, with obstruction handling for buildings.
    """
    dome = np.zeros((max_azimuth, max_elevation), dtype=int)
    domeDists = np.zeros((max_azimuth, max_elevation), dtype=float)

    for _, row in pdf.iterrows():
        a = int(row["azimuth_bin"])
        e = int(row["elevation_bin"])
        dome[a, e] = row["p_classification"]
        domeDists[a, e] = row["distance"]

    # Mark parts of the dome that are obstructed by buildings
    if np.any(dome == 6):  # 6 = buildings
        bhor, bver = np.where(dome == 6)
        builds = np.stack((bhor, bver), axis=-1)
        shape = (builds.shape[0] + 1, builds.shape[1])
        builds = np.append(builds, (bhor[0], bver[0])).reshape(shape)
        azimuth_change = builds[:, 0][:-1] != builds[:, 0][1:]
        keep = np.where(azimuth_change)
        roof_rows, roof_cols = builds[keep][:, 0], builds[keep][:, 1]
        for roof_row, roof_col in zip(roof_rows, roof_cols):
            condition = np.where(np.logical_or(
                domeDists[roof_row, :roof_col] > domeDists[roof_row, roof_col],
                dome[roof_row, :roof_col] == 0
            ))
            dome[roof_row, :roof_col][condition] = 6

    return dome

# Plot dome
def generate_plot_image(dome):
    # Create circular grid
    theta = np.linspace(0, 2*np.pi, 180, endpoint=False)
    radius = np.linspace(0, 90, 90)
    theta_grid, radius_grid = np.meshgrid(theta, radius)

    Z = dome.copy().astype(float)
    
    Z = Z.T[::-1, :]  # Transpose and flip vertically

    Z[Z == 0] = 0
    Z[np.isin(Z, [5])] = 0.5
    Z[Z == 6] = 1

    if Z[Z == 6].size == 0:
        Z[0, 0] = 1  # Force plot to show something

    fig = plt.figure(figsize=(4, 4))
    ax = fig.add_subplot(111, projection='polar')
    cmap = plt.get_cmap('tab20c')
    ax.pcolormesh(theta, radius, Z, cmap=cmap)
    ax.set_ylim([0, 90])
    ax.tick_params(labelleft=False)
    ax.set_theta_zero_location("N")
    ax.set_xticks([])
    ax.set_yticks([])

    buf = io.BytesIO()
    plt.savefig(buf, format='png', bbox_inches='tight', pad_inches=0)
    plt.close(fig)
    buf.seek(0)
    img_base64 = base64.b64encode(buf.read()).decode('utf-8')
    return img_base64


def process_and_plot(pdf: pd.DataFrame) -> pd.DataFrame:
    fid = pdf["fid"].iloc[0]

    # Create dome with building/vegetation obstruction
    dome = create_dome(pdf)

    # Generate base64-encoded fisheye plot image
    plot_base64 = generate_plot_image(dome)

    # Compute SVF and obstruction metrics
    SVF, tree_percentage, build_percentage = calculate_SVF(100, dome)

    return pd.DataFrame(
        [[fid, dome.tolist(), plot_base64, SVF, tree_percentage, build_percentage]],
        columns=["fid", "dome", "plot", "SVF", "treeObstruction", "buildObstruction"]
    )

Results

As you can see in the code above, some parts still rely on NumPy structures for operations like dome construction, plotting and SVF calculation. Since each grid point corresponds to a single dome, we can safely apply these NumPy-based functions on a per-row basis. To do this efficiently in PySpark , without too much code refactoring, I use the applyInPandas method. This allows us to apply our existing Pandas-based logic directly to each group of rows (grouped by the "fid" column) within the PySpark DataFrame. This way, we can leverage distributed processing in Spark while reusing existing, well-tested NumPy code for the dome and SVF calculations.

				
					# Desired schema
output_schema = StructType([
    StructField("fid", IntegerType()),
    StructField("dome", ArrayType(ArrayType(IntegerType()))),
    StructField("plot", StringType()),
    StructField("SVF", FloatType()),
    StructField("treeObstruction", FloatType()),
    StructField("buildObstruction", FloatType())
])

result_df = closest_points.groupBy("fid").applyInPandas(process_and_plot, schema=output_schema)
result_df.write.mode("overwrite").saveAsTable(f"geospatial.pointcloud.wasahington_grid")

				
					
# Fetch the the grid point with fid = 105 for a sample visualization
pdf = result_df.filter("fid = 105").select("fid", "plot").toPandas()

for index, row in pdf.iterrows():
  # Decode base64 string to bytes
  img_bytes = base64.b64decode(img_base64)

  # Load image with PIL
  image = Image.open(io.BytesIO(img_bytes))

  # Display using matplotlib (preserves original colors)
  plt.figure(figsize=(6, 6))
  plt.imshow(image)
  plt.axis('off')  # Hide axes
  plt.show()

What is next?

In our live training “Databricks Geospatial in a Day” at RevoData Office, we’ll delve deeper into the logic behind this code and use this example to demonstrate how to:

Set up a cluster capable of processing point cloud data
Visualize LiDAR point clouds directly in Databricks
Efficiently partition point cloud data for distributed processing
Tackle the challenges of code migration and minimal refactoring

Go ahead and grab your spot for the training using the link below, can’t wait to see you there!
https://revodata.nl/databricks-geospatial-in-a-day/

Melika Sajadian

Senior Geospatial Consultant at RevoData, sharing with you her knowledge about Databricks Geospatial

Building a Geospatial Time Machine

Revodata — Fri, 23 Jan 2026 13:46:16 +0000

Editor’s note: This post was originally published June 12th, 2025.

What is environmental change detection?

Think of it as a time machine for planet Earth. No flux capacitor, no DeLorean, and no Marty McFly needed! Just some cool tech that lets us peek into how our world changes over time.

Detecting and understanding environmental changes is essential for informed decision-making, sustainable development, and effective disaster risk reduction. By systematically identifying and monitoring trends such as deforestation, urban expansion, and coastal erosion at an early stage, policymakers, urban planners, environmental agencies, and other relevant stakeholders can implement timely and proactive measures to mitigate adverse impacts on ecosystems, biodiversity, and vulnerable human populations.

Change detection provides a critical evidence base for assessing the effectiveness of existing environmental policies and conservation strategies. It also plays a key role in informing the planning and development of resilient infrastructure capable of withstanding future environmental stresses. Furthermore, by offering accurate and up-to-date information, it supports more efficient resource allocation and helps prioritize areas facing the greatest risks.

Above all, the ability to detect and interpret environmental changes enhances society’s capacity to respond to climate-related challenges such as sea-level rise, extreme weather events, and habitat degradation. In this context, change detection serves as a vital tool for promoting more responsible and adaptive management of the planet in an era marked by rapid and often unpredictable transformation. While this “time machine” can’t rewrite history, it empowers us to learn from the past and chart a course toward a more sustainable future.

For today’s adventure, I’m using Databricks and Apache Sedona , they’re basically the power tools for working with geospatial data without your computer having a meltdown. And where are we headed? San Francisco,specifically the SoMa (South of Market) area! We’re gonna see how it looked in August 2022 versus February 2025. Trust me, even in just a couple years, you’d be surprised how much a neighborhood can transform. It’s like watching your city grow up in fast-forward.

Implementation

The following implementation is part of the training we offer at RevoData focused specifically on leveraging the geospatial capabilities of Databricks.

Remote sensing and photogrammetry cover a lot of ground, you’ve got active and passive sensors, different types of resolution (spatial, spectral, temporal, radiometric), geo-referencing, orthophoto generation, and electromagnetic spectrum analysis. Instead of getting bogged down in all the technical details, I’m going to focus on why I chose certain methods and what other options could potentially be considered. Let’s get started.

Datasets

First things first, we need data, but what kind? For change detection in an urban environment, we typically rely on satellite or aerial images captured in at least four spectral bands: Red, Green, Blue, and Near Infrared (NIR). Why NIR? Well, it all comes down to physics. Different sensors detect different ‘flavors’ of energy, from visible light our eyes see to invisible infrared or microwave radiation. Each sensor type reveals unique information about Earth’s surface based on the specific energy waves it can measure. What the sensor ‘sees’ depends on how objects interact with these waves:

Absorption: The surface soaks up the energy (like dark pavement heating in sunlight)
Reflection: The energy bounces back (like light mirroring off a lake)
Transmission: The energy passes through (like sunlight through clear water)

Different materials, such as water, soil, concrete, vegetation, each have unique ‘fingerprints’ in how they handle these waves. That’s how we can identify and monitor features from satellites or aircraft!

These interactions allow us to calculate indices such as the Normalized Difference Vegetation Index (NDVI) and the Normalized Difference Water Index (NDWI). These are simple and useful for classifying each image pixel as vegetation, water, or bare ground. You can also go a step further and apply machine learning or pattern recognition algorithms, like the maximum likelihood classifier, to identify different phenomena across the landscape.

The good news? This kind of imagery is freely available. You can easily access the data in GeoTiff format through the NOAA Data Access Viewer and begin your own journey through space and time.

Data ingestion

Once again, Apache Sedona makes geospatial data ingestion remarkably easy.

				
					from pyspark.sql.functions import expr, explode, col
from sedona.spark import *
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql import SparkSession

config = SedonaContext.builder() .\
    config('spark.jars.packages',
           'org.apache.sedona:sedona-spark-shaded-3.3_2.12:1.7.1,'
           'org.datasyslab:geotools-wrapper:1.7.1-28.5'). \
    getOrCreate()

sedona = SedonaContext.create(config)


file_urls = {"2022": f"s3://{dataset_bucket_name}/geospatial-dataset/raster/orthophoto/soma/2022/2022_4BandImagery_SanFranciscoCA_J1191044.tif", 
              "2025": f"s3://{dataset_bucket_name}/geospatial-dataset/raster/orthophoto/soma/2025/2025_4BandImagery_SanFranciscoCA_J1191043.tif"}

df_image_2025 = sedona.read.format("binaryFile").load(file_urls["2025"])
df_image_2025 = df_image_2025.withColumn("raster", expr("RS_FromGeoTiff(content)"))

df_image_2022 = sedona.read.format("binaryFile").load(file_urls["2022"])
df_image_2022 = df_image_2022.withColumn("raster", expr("RS_FromGeoTiff(content)"))

Exploring the data

We can retrieve metadata from the 2025 image by executing the following query:

				
					
df_image_2025.createOrReplaceTempView("image_new_vw")
display(spark.sql("""SELECT RS_MetaData(raster) AS metadata, 
                  RS_NumBands(raster) AS num_bands,
                  RS_SummaryStatsAll(raster) AS summary_stat,
                  RS_BandPixelType(raster) AS band_pixel_type,
                  RS_Count(raster) AS count 
                  FROM image_new_vw"""))

Tiling for scalable image analysis

To improve scalability and enhance performance in the upcoming analysis, we first tile the image into smaller segments:

				
					
tile_w = 195
tile_h = 191

tiled_df_2025 = df_image_2025.selectExpr(
  f"RS_TileExplode(raster, {tile_w}, {tile_h})"
).withColumnRenamed("x", "tile_x").withColumnRenamed("y", "tile_y").withColumn("width", expr("RS_Width(tile)")).withColumn("height", expr("RS_height(tile)"))
window_spec = Window.orderBy("tile_x", "tile_y")
tiled_df_2025 = tiled_df_2025.withColumn("rn", F.row_number().over(window_spec)).withColumn("year", lit(2025))

tiled_df_2022 = df_image_2022.selectExpr(
  f"RS_TileExplode(raster, {tile_w}, {tile_h})"
).withColumnRenamed("x", "tile_x").withColumnRenamed("y", "tile_y").withColumn("width", expr("RS_Width(tile)")).withColumn("height", expr("RS_height(tile)"))
window_spec = Window.orderBy("tile_x", "tile_y")
tiled_df_2022 = tiled_df_2022.withColumn("rn", F.row_number().over(window_spec)).withColumn("year", lit(2022))

union_raster = tiled_df_2022.unionByName(tiled_df_2025, allowMissingColumns=False)
window_spec = Window.partitionBy("rn").orderBy(F.desc("year"))
union_raster = union_raster.withColumn("index", F.row_number().over(window_spec))

Image classification

We perform pixel-wise classification by calculating the NDVI and NDWI indices and applying a simple decision tree. The classification criteria are informed not only by the definitions of NDVI and NDWI, but also by temporal differences between the two images, which were captured at different times of the year and day. These differences are observable, for instance, in the varying shadows cast by buildings. As with any classification approach, some degree of misclassification is expected. Then the results are written to a Delta table.

				
					# Calculating NDVI using Red and NIR bands as NDVI = (NIR - Red) / (NIR + Red)
union_raster = union_raster.withColumn(
    "ndvi",
    expr(
        "RS_Divide("
        "  RS_Subtract(RS_BandAsArray(tile, 1), RS_BandAsArray(tile, 4)), "
        "  RS_Add(RS_BandAsArray(tile, 1), RS_BandAsArray(tile, 4))"
        ")"
    )
)


# Calculating NDWI using Green and NIR bands as NDWI = (Green - NIR) / (Green + NIR)
union_raster = union_raster.withColumn(
    "ndwi",
    expr(
        "RS_Divide("
        "  RS_Subtract(RS_BandAsArray(tile, 4), RS_BandAsArray(tile, 2)), "
        "  RS_Add(RS_BandAsArray(tile, 4), RS_BandAsArray(tile, 2))"
        ")"
    )
)

# Red and Green bands as arrays in new columns
union_raster = union_raster.withColumn(
    "red",
    expr(
        "RS_BandAsArray(tile, 1)"
    )
).withColumn(
    "green",
    expr(
        "RS_BandAsArray(tile, 2)"
    )
)


# Classification tree based on Red and Green bands and NDVI, NDWI
union_raster = union_raster.withColumn(
    "classification",
    F.expr("""
        transform(
            arrays_zip(ndvi, ndwi, red, green),
            x -> 
                CASE 
                    WHEN year = 2022 THEN
                        CASE 
                            WHEN x.red < 15 AND x.green < 15 THEN 4
                            WHEN x.ndvi > 0.35 AND x.ndwi < -0.35 THEN 2
                            WHEN (x.ndvi < -0.2 AND x.ndwi > 0.35) OR (x.red < 15 AND x.ndwi > 0.35) OR (x.ndwi > 0.45) THEN 3
                            WHEN x.ndvi >= -0.3 AND x.ndvi <= 0.3 AND x.ndwi >= -0.3 AND x.ndwi <= 0.3 THEN 1
                            ELSE 999
                        END
                    WHEN year = 2025 THEN
                        CASE 
                            WHEN x.red < 15 AND x.green < 15 THEN 4
                            WHEN x.ndvi > 0.3 AND x.ndwi < -0.15 THEN 2
                            WHEN x.ndvi < -0.35 AND x.ndwi > 0.55 THEN 3
                            WHEN (x.ndvi >= -0.5 AND x.ndvi <= 0.5 AND x.ndwi >= -0.5 AND x.ndwi <= 0.5) OR (x.ndvi > 0.8 AND x.ndwi > 0.3) THEN 1                           
                            ELSE 999
                        END
                    ELSE 999
                END
        )
    """)
)

# Classification array as a new band in the raster and defining no data value as 999 
classification_df = (
    union_raster
    .select("tile_x", "tile_y", "rn", "year", "index", expr("RS_MakeRaster(tile, 'I', classification) AS tile").alias("tile"))
    .select("tile_x", "tile_y", "rn", "year", "index", expr("RS_SetBandNoDataValue(tile,1, 999, false)").alias("tile"))
    .select("tile_x", "tile_y", "rn", "year", "index", expr("RS_SetBandNoDataValue(tile,1, 999, true)").alias("tile"))
)

classification_df = classification_df.withColumn("maxValue", expr("""RS_SummaryStats(tile, "max", 1, false)"""))

classification_df.withColumn("raster_binary", expr("RS_AsGeoTiff(tile)")).select("tile_x", "tile_y","rn", "year", "index", "raster_binary").write.mode("overwrite").saveAsTable("geospatial.soma.classification")

Filling missing data via interpolation

In the images above, we can observe white areas, i.e. pixels that couldn’t be classified and were left as no-data values. To address this, we can apply interpolation using the Inverse Distance Weighting (IDW) method. Apache Sedona simplifies this process: we filter out the tiles with no-data values and perform the interpolation. After that, we store the interpolated results in a Delta table.

				
					

# Separating the dataframe into two dataframes based on the 
no_interpolation_df = classification_df.filter(classification_df["maxValue"] != 999).select("tile_x", "tile_y","rn", "year", "index", "raster_binary")
interpolated_df = classification_df.filter(classification_df["maxValue"] == 999).select("tile_x", "tile_y","rn", "year", "index", expr("RS_Interpolate(tile, 2.0, 'variable', 48.0, 6.0)").alias("tile")).withColumn("raster_binary", expr("RS_AsGeoTiff(tile)")).select("tile_x", "tile_y","rn", "year", "index", "raster_binary")
union_df = interpolated_df.unionByName(no_interpolation_df, allowMissingColumns=False)
union_df.write.mode("overwrite").saveAsTable("geospatial.soma.interpolation")
union_df = spark.table("geospatial.soma.interpolation").withColumn("tile", expr("RS_FromGeoTiff(raster_binary)"))

Evaluating classification changes between the two images

It would be useful to create a band that highlights the differences between the classifications from 2022 and 2025, and add it as the third band in the image.

				
					
union_df.createOrReplaceTempView("union_df_vw")

merged_raster = sedona.sql("""
    SELECT rn, RS_Union_Aggr(tile, index) AS raster
    FROM union_df_vw
    GROUP BY rn
""")

merged_raster.createOrReplaceTempView("merged_raster_vw")
diff_raster = merged_raster.withColumn("diff_band", expr( 
        "RS_LogicalDifference("
        "RS_BandAsArray(raster, 1), RS_BandAsArray(raster, 2)"
        ")"))

result_df = diff_raster.select("rn", expr("RS_AddBandFromArray(raster, diff_band) AS raster").alias("raster")).withColumn("raster_binary", expr("RS_AsGeoTiff(raster)"))
result_df.select("rn", "raster_binary").write.mode("overwrite").saveAsTable("geospatial.soma.change_detection")

Finally, we can detect and visualize changes, such as those in vegetation, using the image below as an example:

What is next?

In our live training “Databricks Geospatial in a Day” at RevoData Office, we’ll delve deeper into the logic behind this code and use this example to demonstrate how to:

Visualize GeoTIFF files
Generate optimally sized tiles from large raster datasets
Configure clusters optimized for compute-intensive workloads, such as spatial interpolation
Partition Spark DataFrames by the raster column to accelerate processing
Use complementary tools to seamlessly merge tiles back into a single large GeoTIFF

Go ahead and grab your spot for the training using the link below, can’t wait to see you there!
https://revodata.nl/databricks-geospatial-in-a-day/

Melika Sajadian

Senior Geospatial Consultant at RevoData, sharing with you her knowledge about Databricks Geospatial

Geospatial Location-Allocation in Databricks: A Scalable GIS Approach

Revodata — Wed, 26 Nov 2025 13:47:23 +0000

Editor’s note: This post was originally published June 6th, 2025.

What is a location-allocation problem in GIS?

Summer’s almost here, so let’s start with a sunny example. Imagine you’re tasked with finding the top 1,000 most profitable locations across the UK to park an ice cream cart. It’s not just about sweet treats, it’s a geospatial location-allocation problem, one that uses data and analysis to determine optimal placement based on demand, foot traffic, and competition.

But let’s be honest: this isn’t really about ice cream carts.

In our personal lives, we constantly solve these kinds of problems, sometimes without even realizing it.

Take a simple scenario: you’re planning a weekly dinner with your five closest friends, each living in different parts of the city. You want to choose a restaurant that minimizes the total travel time for everyone. Or think bigger: you’re looking for a house to rent or buy. You want a location that balances multiple factors: your commute, your partner’s job, your child’s school, your gym, and maybe even your favorite grocery store.

These are location-allocation problems, and they are everywhere.

Governments and businesses face them every day. Where should we place new fire stations to achieve 8-minute response times across a city? Where is the best spot for a new wastewater treatment plant? How do we select optimal locations for bus depots, EV charging stations, or substations to relieve electricity grid congestion?

The list of applications is endless. And the process doesn’t stop after choosing the top candidate locations. You’ll often want to monitor how these locations perform over time. Do they still serve the demand well? Has the population shifted? What insights can we gain from usage data, customer behavior, or operational costs?

This evolution naturally leads into the world of data products, big data, predictive modeling, and machine learning. Since these challenges are inherently spatial, Geographic Information Systems (GIS) remain indispensable, not just for mapping, but as comprehensive information systems that integrate people, data, analysis, software, and hardware. To truly harness the power of GIS and spatial intelligence, you need a platform that supports all these components in a seamless and scalable way.

There are many tools out there to solve these problems, but in this article, I want to focus on Databricks and Apache Sedona, two powerful technologies I’ve chosen for tackling large-scale location-allocation problems. I’ll explain why these tools make sense for spatial big data analysis and share how they fit into the modern geospatial stack.

As you read, I encourage you to reflect on your own work. What kind of location-allocation problems are you facing in your industry? Share your thoughts in the comments, I’d love to hear your perspective.

Implementation:

The following implementation is part of the training we offer at RevoData focused specifically on leveraging the geospatial capabilities of Databricks.

As someone interested in data product development, I prefer to frame this as an Agile user story, as shown below:

As an ice-cream company, I want to know where the best spots are to locate 1000 ice-cream carts across the UK, so that I can maximize sales and profit.

· We want to know how many spots we can allocate to each county based on its area.

· Based on our BI dashboards, we know that parks have the highest sales. So, the spots should be near park entrances.

· Larger parks with more functionalities (playgrounds, sports fields, etc.) are more attractive.

· Park entrances that are more accessible are more desirable.

Now, as a data engineer or GIS specialist taking on this task, here is how I would approach it:

Datasets:

First, I need to gather the relevant datasets. I found the following datasets from the UK Ordnance Survey:

· OS Open Greenspace

· Boundary-Line

· OS Open Roads

Data design pattern:

Next, from a conceptual perspective, I want to follow the Medallion Architecture (or Multi-hop Architecture). This involves ingesting the raw data as-is (bronze tables), applying filtering, joins, and enrichment (silver tables), and producing the top parks and park entrances (gold table).

Ingestion bronze tables:

Using Apache Sedona, ingestion is quite straightforward. In the code below, you can see how the GeoPackage files are read from an Azure storage account or an AWS S3 bucket:

				
					
from sedona.spark import *
from sedona.maps.SedonaPyDeck import SedonaPyDeck
from sedona.maps.SedonaKepler import SedonaKepler
from pyspark.sql import functions as F
from sedona.sql import st_functions as st
from sedona.sql.types import GeometryType
from pyspark.sql.functions import expr


config = SedonaContext.builder() .\
    config('spark.jars.packages',
           'org.apache.sedona:sedona-spark-shaded-3.3_2.12:1.7.1,'
           'org.datasyslab:geotools-wrapper:1.7.1-28.5'). \
    getOrCreate()

sedona = SedonaContext.create(config)


catalog_name = "geospatial"
cloud_provider = "aws"

# Creating a dictionary that specifies the name of the schemas and tables based on the geopackage layers
schema_tables = {
    "lookups": {
        "bdline_gb.gpkg": ["boundary_line_ceremonial_counties"],
    },
    "greenspaces": {
        "opgrsp_gb.gpkg": ["greenspace_site", "access_point"]
    },
    "networks": {
        "oproad_gb.gpkg": ["road_link", "road_node"],
    },
}


# Writing each geopackage layer into a Delta table
for schema, files in schema_tables.items():
    for gpkg_file, layers in files.items():
        for table_name in layers:
            if cloud_provider == "azure":
                df = sedona.read.format("geopackage").option("tableName", table_name).load(f"abfss://{dataset_container_name}@{dataset_storage_account_name}.dfs.core.windows.net/{dataset_dir}/{gpkg_file}")
            elif cloud_provider == "aws":
                df = sedona.read.format("geopackage").option("tableName", table_name).load(f"s3://{dataset_bucket_name}/{dataset_input_dir}/{gpkg_file}")
            df.write.mode("overwrite").saveAsTable(f"{catalog_name}.{schema}.{table_name}")
            print(f"Table {catalog_name}.{schema}.{table_name} is created, yay!")

Explore and visualize:

Now that we have transformed the GeoPackage files into Delta tables with a geometry column, it’s easy and insightful to explore the data and translate business requirements into meaningful code:

				
					
%sql
SELECT DISTINCT function
FROM geospatial.greenspaces.greenspace_site
ORDER BY function;

				
					
# Reading a Delta table to a spark dataframe
access_point_df  = spark.sql("""
  SELECT fid, access_type, geometry
  FROM geospatial.greenspaces.access_point
  WHERE access_type = 'Pedestrian'
""").limit(500)

# Visualize 500 access points on map using SedonaKepler 
map = SedonaKepler.create_map()
SedonaKepler.add_df(map, access_point_df, name="park access points")
map

Enrich silver tables:

With the GeoPackage files now in Delta format, we apply a few enrichments to each table:

1. Using UK administrative boundaries, we distribute the 1000 ice-cream carts based on the area of each region, so that larger areas receive more carts.

				
					
# Reading the administrative boundaries and calculating the area
administrative_boundaries  = spark.sql("""
  SELECT b.fid, b.name, ST_Area(b.geometry) AS area, geometry, ST_Geohash(ST_Transform(geometry,'epsg:27700','epsg:4326'), 5) AS geohash
  FROM geospatial.lookups.boundary_line_ceremonial_counties b 
""").repartitionByRange(2, "geohash")

total_locations = 1000
uk_arae = administrative_boundaries.selectExpr("SUM(area) AS total_area").first().total_area

# Calculating how many ice-cream carts can be located in each country based on its area
administrative_boundaries = administrative_boundaries.withColumn(
    "number_of_locations",
    F.round(F.col("area") / uk_arae * F.lit(total_locations)).cast("integer")
).orderBy("number_of_locations", ascending=True)

administrative_boundaries.write.mode("overwrite").option("mergeSchema", "true").saveAsTable(f"geospatial.lookups.boundary_line_ceremonial_counties_silver")
administrative_boundaries.createOrReplaceTempView("administrative_boundaries_vw")

2. The greenspace table contains parks, cemeteries, church gardens, etc. We focus on parks. While parks may include playgrounds and other features, we are not interested in the geometries of those smaller areas and the larger park geometry is sufficient. We also categorize parks by size using area quantiles. At the end, we are also interested to know that each parks belong to which county.

				
					
# Reading the greenspace_site table and filtering relevant objects
df_greenspaces_bronze = spark.table("geospatial.greenspaces.greenspace_site") \
    .filter("function IN ('Play Space', 'Playing Field', 'Public Park Or Garden')")
df_greenspaces_bronze.createOrReplaceTempView("greenspace_site_bronze_vw")

# Finding the small playgrounds inside larger parks
greenspace_site_covered = spark.sql("""
  SELECT 
    g1.id AS g1_id,
    g2.id AS g2_id,
    g1.function AS g1_function,
    g2.function AS g2_function,
    g2.distinctive_name_1 AS g2_name,
    g2.geometry AS geometry,
    ST_Geohash(ST_Transform(g2.geometry ,'epsg:27700','epsg:4326'), 5) AS geohash
  FROM greenspace_site_bronze_vw g1
  INNER JOIN greenspace_site_bronze_vw g2
    ON ST_CoveredBy(g1.geometry, g2.geometry)
   AND g1.id != g2.id
""").repartitionByRange(10, "geohash")

greenspace_site_covered.createOrReplaceTempView("greenspace_site_covered_vw")

# Aggrgating the small playgrounds in the larger parks
greenspace_site_aggregated = spark.sql("""
  SELECT 
    g2_id AS id,
    concat_ws(', ', any_value(g2_function), collect_set(g1_function)) AS functions,
    count(*) + 1 AS num_functions,
    g2_name AS name,
    ST_Area(geometry) AS area,
    geometry,
    ST_Geohash(ST_Transform(geometry,'epsg:27700','epsg:4326'), 5) AS geohash
  FROM greenspace_site_covered_vw
  GROUP BY g2_id, g2_name, geometry
""").repartitionByRange(10, "geohash")
greenspace_site_aggregated.createOrReplaceTempView("greenspace_site_aggregated_vw")

# Find the parks without any smaller playgrounds inside them
greenspace_site_non_covered = spark.sql("""
  SELECT id, function, 1 AS num_functions, distinctive_name_1 AS name, ST_Area(geometry) AS area, geometry,
  ST_Geohash(ST_Transform(geometry,'epsg:27700','epsg:4326'), 5) AS geohash
  FROM greenspace_site_bronze_vw
  WHERE id NOT IN (SELECT g1_id FROM greenspace_site_covered_vw)
    AND id NOT IN (SELECT g2_id FROM greenspace_site_covered_vw)
""").repartitionByRange(10, "geohash")
greenspace_site_non_covered.createOrReplaceTempView("greenspace_site_non_covered_vw")

# Union the above two dataframes
greenspace_site_all = spark.sql("""
SELECT * FROM greenspace_site_aggregated_vw
UNION
SELECT * FROM greenspace_site_non_covered_vw""").repartitionByRange(10, "geohash")

# Calculate 0%, 20%, 40%, 60%, 80%, 100% quantiles
quantiles = greenspace_site_all.approxQuantile("area", [0.0, 0.2, 0.4, 0.6, 0.8, 1.0], 0.001)
print("Quintile breakpoints:", quantiles)

q0, q20, q40, q60, q80, q100 = quantiles


# Categorize each park based on its area
greenspace_site_all = greenspace_site_all.withColumn(
    "area_category",
    F.when(F.col("area") <= q20, 20)
     .when(F.col("area") <= q40, 40)
     .when(F.col("area") <= q60, 60)
     .when(F.col("area") <= q80, 80)
     .otherwise(100)
)

display(greenspace_site_all.groupBy("area_category").count().orderBy("area_category"))
greenspace_site_all.createOrReplaceTempView("greenspace_site_all_vw")

# In the final step, we aim to identify the county each park falls within.
greenspace_site_silver = spark.sql("""
WITH tmp AS (
  SELECT a.id, a.functions, a.num_functions, a.name, a.area, a.area_category, a.geometry, a.geohash,
  RANK() OVER(PARTITION BY a.id ORDER BY ST_Area(ST_Intersection(a.geometry, b.geometry)) DESC) AS administrative_rank,
  b.fid as administrative_fid
  FROM greenspace_site_all_vw a
  INNER JOIN administrative_boundaries_vw b
  ON ST_Intersects(a.geometry, b.geometry)
)
SELECT tmp.id, tmp.functions, tmp.num_functions, tmp.name, tmp.area, tmp.area_category, tmp.administrative_fid, tmp.geometry, tmp.geohash
FROM tmp
WHERE  administrative_rank = 1
""").repartitionByRange(10, "geohash")

greenspace_site_silver.createOrReplaceTempView("greenspace_site_silver_vw")

# Write the dataframe into the correspoding silver Delta Table
greenspace_site_silver.write.mode("overwrite").option("mergeSchema", "true").saveAsTable(f"geospatial.greenspaces.greenspace_site_silver")

3. For each road node (i.e. street junction), we calculate the number of edges (streets) connected to it (i.e. the node degree). The more connections, the more prominent the junction.

				
					
# Calculate the degree of each road node based on the edges it connects to
road_nodes_silver = spark.sql("""
SELECT a.fid, a.id, a.form_of_road_node, COUNT(DISTINCT b.id) AS degree, a.geometry, ST_Geohash(ST_Transform(a.geometry,'epsg:27700','epsg:4326'), 5) AS geohash 
FROM geospatial.networks.road_node a
JOIN geospatial.networks.road_link b
ON a.id = b.start_node
OR a.id = b.end_node
GROUP BY a.fid, a.id, a.form_of_road_node, a.geometry
ORDER BY COUNT(DISTINCT b.id) DESC""").repartitionByRange(10, "geohash")

# Write to a Delta table
road_nodes_silver.write.mode("overwrite").saveAsTable(f"geospatial.networks.road_node_silver")

4. For park entrances, we focus on pedestrian entries, as they attract more foot traffic, ideal for ice-cream carts. We also consider the node degree of the nearest street junction to each entrance, as this reflects how well-connected and accessible the entrance is.

				
					
# Find relevant park entrances
greenspace_entries = spark.sql("""
SELECT a.fid, a.id, a.access_type, a.ref_to_greenspace_site, a.geometry, ST_Geohash(ST_Transform(a.geometry,'epsg:27700','epsg:4326'), 5) AS geohash
FROM geospatial.greenspaces.access_point a
WHERE a.ref_to_greenspace_site IN (SELECT id FROM greenspace_site_silver_vw) AND a.access_type IN ('Pedestrian', 'Motor Vehicle And Pedestrian')""").repartitionByRange(10, "geohash")

greenspace_entries.createOrReplaceTempView("greenspace_entries_vw")

# Find its nearest road junction
entry_road_1nn = spark.sql("""
SELECT
    a.fid,
    a.id,
    a.access_type,
    a.ref_to_greenspace_site,
    b.fid AS nearest_road_node_fid,
    ST_Distance(a.geometry, b.geometry) AS distance_to_road_node,
    a.geometry, 
    a.geohash
FROM greenspace_entries_vw a 
INNER JOIN geospatial.networks.road_node_silver b 
ON ST_kNN(a.geometry, b.geometry, 1, FALSE)""").repartitionByRange(10, "geohash")

# Write to a Delta table
entry_road_1nn.write.mode("overwrite").saveAsTable(f"geospatial.greenspaces.access_point_silver")

Aggregated gold tables:

Now that we’ve enriched our silver tables, we can rank each park and entrance based on the defined criteria and select our top 1000 locations.

				
					

%sql
-- Top parks to locate the ice-cream carts
CREATE OR REPLACE TABLE geospatial.greenspaces.top_parks_gold AS ( 
  WITH tmp AS (
    SELECT 
    gs.id,
    gs.name AS park_name,
    gs.functions,
    gs.num_functions,
    gs.area_category,
    sum(rn.degree) AS accessibility_degree,
    cc.name AS county,
    cc.number_of_locations,
    row_number() OVER (PARTITION BY cc.name ORDER BY gs.area_category DESC, gs.num_functions DESC, sum(degree) DESC) AS park_rank,
    ST_AsEWKB(gs.geometry) AS geometry
    FROM geospatial.greenspaces.greenspace_site_silver gs
    LEFT JOIN geospatial.greenspaces.access_point_silver ga
    ON gs.id = ga.ref_to_greenspace_site
    LEFT JOIN geospatial.networks.road_node_silver rn
    ON ga.nearest_road_node_fid = rn.fid
    LEFT JOIN geospatial.lookups.boundary_line_ceremonial_counties_silver cc
    ON gs.administrative_fid = cc.fid
    GROUP BY gs.name, gs.functions, gs.num_functions, gs.area_category, cc.number_of_locations, cc.name, gs.geometry, gs.id, cc.name)
  SELECT * FROM tmp WHERE park_rank <= number_of_locations);

				
					
%sql
-- Top entrances to locate the ice-cream carts
CREATE OR REPLACE TABLE geospatial.greenspaces.top_entrances_gold AS (
  WITH A AS (
    SELECT 
    ga.id,
    ga.ref_to_greenspace_site AS park_id,
    gs.park_rank,
    ga.access_type,
    rn.degree,
    ga.distance_to_road_node,
    row_number() OVER (PARTITION BY gs.id ORDER BY rn.degree DESC, ga.distance_to_road_node ASC) AS entry_rank,
    ST_AsEWKB(ga.geometry) AS geometry,
    ST_X(ST_Transform(ga.geometry,'epsg:27700','epsg:4326')) AS longitude,
    ST_Y(ST_Transform(ga.geometry,'epsg:27700','epsg:4326')) AS latitude
    FROM geospatial.greenspaces.top_parks_gold gs
    LEFT JOIN geospatial.greenspaces.access_point_silver ga
    ON gs.id = ga.ref_to_greenspace_site
    LEFT JOIN geospatial.networks.road_node_silver rn
    ON ga.nearest_road_node_fid = rn.fid
  ) 
  SELECT * FROM A WHERE entry_rank = 1
);

What is next?

In our live training “Databricks Geospatial in a Day” at RevoData Office, we’ll delve deeper into the logic behind this code and use this example to demonstrate how to:

· Set up a cluster with all necessary geospatial libraries

· Follow best practices for working with Unity Catalog

· Organize data using the Medallion Architecture

· Build an orchestration workflow

· Create an AI/BI dashboard to visualize the top entrances

Melika Sajadian

Senior Geospatial Consultant at RevoData, sharing with you her knowledge about Databricks Geospatial

Mastering Location Data: Geospatial Magic Meets Databricks Power

Revodata — Wed, 26 Nov 2025 13:09:52 +0000

Mastering Location Data: Geospatial Magic Meets Databricks Power

Ever used Google Maps to find your way around? That’s geospatial data in action! It’s information tied to a place on Earth-like where your favorite ice-cream shop is, where roads go, where cities are expanding, and how places change over time, just to name a few.

GIS, or Geographic Information Systems, takes this data and turns it into smart maps and tools that help people make better decisions. From choosing the safest route for a delivery truck to planning where to build a new hospital or identifying areas at risk of floods or urban heat islands, GIS helps us understand where things happen and how we can use that insight to make better decisions.

Geospatial experts often use tools like FME or ArcGIS to look at maps and analyze location data. They usually keep their data in databases like Postgres or Oracle Spatial, and write code in SQL or Python using libraries like PostGIS, GeoPandas, GDAL, or PDAL to get the job done.

But today, we’re dealing with way more data than before. That’s where platforms like Databricks come in. It’s a modern tool that can handle huge amounts of data, run complex workflows faster, and work alongside the tools geospatial folks already use. Think of it as a powerful new teammate for your geospatial projects.

Where should you begin your journey into geospatial data on Databricks? The good news is that RevoData is offering a specialized training session focused entirely on using Databricks for geospatial workflows. This session will guide you through the essentials of working with Databricks. We’ll also look at how Databricks works together with other geospatial tools like FME, ArcGIS, and Postgres. Whether you’re just getting started or looking to optimize your current processes, this training will help you understand the core principles and practical applications of geospatial data integration within the Databricks ecosystem.

You’ll explore the benefits of migrating your geospatial workflows to Databricks, leveraging its modern lakehouse architecture that merges scalable storage with lightning-fast analytics. We’ll walk through the key Python and Spark libraries that enable efficient and flexible spatial data processing, helping you unlock Databricks’ full potential. By the end of the session, you’ll have a clear understanding of when and how to make the shift, and the tools you’ll need to get there.

This training is designed to be hands-on and practical, with exercises that guide you through real-world applications. We’ll work with a variety of geospatial data types – including vector data (like topographic maps and point clouds), raster data (such as aerial imagery and netCDF files), and even graph-based data – to solve meaningful geospatial problem.

Here’s a quick sneak peek at the hands-on training cases:

SQL Server executes queries within a single-node environment, meaning all operations—such as joins, aggregations, and filtering—occur on a centralized database server. The query optimizer determines the best execution plan, using indexes, statistics, and caching to improve efficiency. However, performance is ultimately limited by the resources (CPU, memory, and disk) of a single machine.

Databricks, powered by Apache Spark, distributes query execution across multiple nodes in a cluster. Instead of a single execution plan operating on one server, Spark breaks down queries into smaller tasks, which are executed in parallel across worker nodes. This approach enables Databricks to handle massive datasets efficiently, leveraging memory and compute resources across a distributed system.

Location-allocation

Location-allocation problem: Summer’s almost here, and what better way to celebrate than with a sunny use case? We’ll dive into a geospatial analysis to uncover the top 1,000 sweetest spots in the UK to park an ice cream cart and scoop up the highest profits.

Shortest path between A and B

Shortest path calculation: The shortest path algorithm is one of the most widely used techniques in network analysis, often applied to optimize routes and reduce travel time. In this case, we’ll use it to map out the most efficient paths from a well-known landmark to all other locations within a selected area in the UK, helping us better understand connectivity and accessibility across the region.

Change Detection

Temporal change detection using aerial images: This use case compares high-resolution (0.25 meter) aerial orthophotos with RGB and infrared bands from 2022 and 2025 to detect changes in land use, buildings, and vegetation in SoMa, San Francisco. The results support urban planning and development decisions by highlighting growth and transformation in the neighborhood.

Sky View Factor

In the upcoming posts, we’ll dive deeper into each use case. Stay tuned!

Melika Sajadian

Senior Geospatial Consultant at RevoData, sharing with you her knowledge about Databricks Geospatial

SQL Server vs Apache Spark: A Deep Dive into Execution Differences

Revodata — Wed, 26 Nov 2025 10:28:51 +0000

SQL Server vs Apache Spark: A Deep Dive into Execution Differences

The way SQL Server and Apache Spark (backbone of Databricks) process queries is fundamentally different, and understanding these differences is crucial when migrating or optimizing workloads. While SQL Server relies on a single-node, transaction-optimized execution engine, Spark in Databricks is built for distributed, parallel processing.

Execution Model: Single-Node vs. Distributed Processing

Query Execution Breakdown

SQL Server: A query is parsed, optimized into an execution plan, and executed on a single machine. It reads data from disk (or memory if cached), processes it using indexes and statistics, and returns results.
Databricks (Spark): A query is parsed and transformed into a Directed Acyclic Graph (DAG), which is then broken down into stages and tasks. The Spark scheduler distributes these tasks across worker nodes, where computations are executed in memory as much as possible before writing results back to storage.

Data Shuffling and Joins

One of the biggest differences between the two systems is how they handle joins and aggregations.

SQL Server: Since all data is processed on a single machine, joins rely heavily on indexes and sorting. If indexes are missing or inefficient, operations like hash joins or merge joins can cause expensive disk I/O.
Databricks (Spark): Joins require shuffling, where data is redistributed across nodes to ensure matching keys are on the same worker. This introduces network overhead but allows for massive scalability. Techniques like broadcast joins (sending a small table to all nodes) help reduce shuffle costs and improve performance.

Caching and Storage Optimization

OptSQL Server relies on the buffer pool to cache frequently accessed data in memory, minimizing disk reads. Indexed data is stored efficiently on disk, and execution plans are cached for reuse.

Databricks, on the other hand, benefits from in-memory caching using Spark’s caching feature, reducing repeated reads from cloud storage (e.g., Azure Blob or AWS S3). Additionally, techniques like Z-ordering and partitioning help optimize data layout, reducing scan times for large datasets.

Fault Tolerance and Scalability

SQL Server operates with ACID transactions and high availability mechanisms like Always On Availability Groups, but it lacks inherent fault tolerance in query execution. If a process fails, it must restart.

Databricks, through Spark, provides fault tolerance via lineage and recomputation. If a node fails, Spark reruns only the affected tasks, ensuring resilience without manual intervention. Additionally, horizontal scalability allows it to scale dynamically based on workload demands.

Do you want to know more?

Are you considering migrating workloads from SQL Server to Databricks? Understanding execution models is key to designing efficient queries and avoiding performance pitfalls. Let’s connect and discuss how to make your transition seamless!

Rafal Frydrych

Senior Consultant at RevoData, sharing with you his knowledge in the opinionated series: Migrating from MSBI to Databricks.

Optimizing Performance: SQL Server vs Databricks

Revodata — Wed, 26 Nov 2025 10:21:26 +0000

Optimizing Performance: SQL Server vs Databricks

Optimization in a Databricks’s Data Lakehouse differs significantly from traditional SQL Server environments due to its architecture and the nature of data storage. While SQL Server relies on indexing, row-based storage, and dedicated disk structures, Databricks leverages distributed storage, columnar formats, and advanced clustering techniques to enhance performance.

Storage Differences: SQL Server vs. Databricks

SQL Server primarily operates with row-oriented storage, which is optimized for transactional workloads where entire records are frequently accessed. It uses indexes to speed up queries by pre-sorting and structuring data efficiently within a disk-based system. On the other hand, Databricks and other modern Lakehouse platforms use columnar storage formats like Parquet, which enable efficient compression and retrieval for analytical workloads. Instead of fixed disk storage, data in Databricks is often stored in cloud-based solutions such as Azure Blob Storage or AWS S3, leveraging distributed file systems to improve scalability and performance.

Indexing in SQL Server vs. Partitioning in Databricks

In SQL Server, indexing is one of the primary ways to optimize queries, allowing fast lookups within structured tables. However, in Databricks, indexing works differently due to the distributed nature of storage. Instead of relying on indexes, Databricks employs partitioning, which segments large datasets into smaller, manageable chunks based on logical keys like date ranges or categories. While SQL Server indexing is crucial for reducing scan times on relational tables, partitioning in Databricks minimizes the amount of data read, significantly improving query performance.

Advanced Optimizations: Z-Ordering, Liquid Clustering, and Vacuum

Beyond partitioning, Databricks offers additional optimization techniques such as Z-Ordering and Liquid Clustering. Z-Ordering helps co-locate related data within files, reducing the amount of data scanned during queries and enhancing performance for range-based filtering. Liquid Clustering further refines this process by dynamically managing data clustering over time, adjusting to changing query patterns without manual intervention.

Another critical aspect of performance tuning in Databricks is Vacuuming. Unlike SQL Server, where deleted data is managed through transaction logs and page reorganizations, Databricks maintains historical file versions that can accumulate over time. Running Vacuum operations purges obsolete data, ensuring storage efficiency and preventing performance degradation.

Making the Most of Lakehouse Optimization

Optimizing your Data Lakehouse isn’t just about applying best practices—it’s about continuously refining your approach based on your data and workloads. Whether you’re transitioning from SQL Server or looking to enhance your Databricks performance, now is the time to take action.

Are you ready to implement these optimization techniques in your own environment? Start by analyzing your query patterns, revisiting your partitioning strategy, or experimenting with Z-Ordering and Liquid Clustering. If you’re facing challenges, let’s talk! Reach out, share your experiences, and let’s navigate the path to high-performance data together.

Rafal Frydrych

Senior Consultant at RevoData, sharing with you his knowledge in the opinionated series: Migrating from MSBI to Databricks.

Orchestration – SQL Server Agent vs. Workflows

Revodata — Wed, 26 Nov 2025 10:11:19 +0000

Orchestration - SQL Server Agent vs. Workflows

One of the pillars of a migration from MSBI to Databricks is orchestration. For years, SQL Server Agent has been the trusted solution for scheduling and automating tasks. It’s simple, well integrated with SQL Server, and has been the backbone of countless ETL jobs, backups, and maintenance routines. But as we look at modern data platforms like Databricks, the question arises: how do Databricks Workflows compare to the familiar SQL Server Agent?

SQL Server Agent: A Reliable Classic with Limits

SQL Server Agent excels in its simplicity. Its GUI-based interface makes it easy to schedule jobs and monitor execution, and its integration with SQL Server ensures a seamless experience for database administrators and BI developers. However, it was built for an era of monolithic systems, and its limitations become apparent in today’s landscape. Scaling beyond SQL Server, working with distributed data, or integrating with cloud-native tools often feels like trying to fit a square peg into a round hole.

Databricks Workflows: Built for Modern Data Needs

Databricks Workflows, on the other hand, are designed for the complexities of modern data engineering. They bring scalability and flexibility to the forefront, enabling you to orchestrate complex pipelines that span Spark jobs, machine learning models, and real-time analytics. Unlike SQL Server Agent, which is tightly tied to SQL Server, Workflows embrace a multi-cloud, multi-tool environment, integrating seamlessly with APIs, cloud services, and third-party platforms.

The shift to Databricks Workflows also introduces new paradigms, such as event-driven orchestration. Tasks can be triggered by events like file arrivals (Auto Loader) or changes in a database, allowing for real-time automation that SQL Server Agent struggles to achieve. Additionally, Databricks provides advanced monitoring and alerting capabilities, giving you deeper insights into your workflows and the ability to resolve issues quickly.

Making the Transition: Challenges and Opportunities

While the transition might feel daunting at first, it’s essential to focus on the opportunities it brings. The flexibility of Workflows allows teams to start small, using familiar SQL tasks, while gradually exploring more advanced capabilities like PySpark. This approach not only reduces the learning curve but also ensures that your team remains productive during the migration.

Orchestration is more than a technical challenge—it’s a transformation in how we think about automation and scalability. Transitioning from SQL Server Agent to Databricks Workflows requires a shift in mindset, but it’s one that unlocks immense potential for modern data teams.

Join the Conversation

Have you started rethinking your approach to orchestration? What challenges or insights have you encountered? Let’s discuss! And if you’re ready to take the next step, we’re here to help you navigate the transition and make the most of what Databricks has to offer.

Rafal Frydrych

Senior Consultant at RevoData, sharing with you his knowledge in the opinionated series: Migrating from MSBI to Databricks.

Quick wins in your Databricks journey: Show value early

Revodata — Wed, 26 Nov 2025 09:29:31 +0000

The common trap: Starting from the bottom

Many companies approach their Databricks migration by starting at the bottom of the stack: rolling out the platform, re-integrating data sources (often via ODBC/JDBC), and building a bronze layer before modelling and consuming the data. While this method seems logical, it often leaves teams “below the surface” for too long, struggling to demonstrate value as they work through foundational layers.

To avoid this, it’s crucial to rethink how you start. Databricks, for instance, can pull data via JDBC, but its true strength lies in AutoLoader and working with files stored in cost-effective blob storage. Adding change data capture (CDC) capabilities with tools like Debezium can enhance this, but it may also introduce dependencies on platform or infrastructure teams who may not share your timeline or goals.

The quickest unlock: Federate into legacy

If your data already resides in a cloud platform like Azure or AWS, the quickest path to success is leveraging native services such as Azure Data Factory (ADF) or AWS Data Workflow Services (DWS). These can convert CDC streams into Parquet files, which are easily stored on blob storage. By using these existing tools, you simplify the process, reduce dependencies, and get data into Databricks faster.

When this isn’t an option, or if you really want to go fast, Unity Catalog’s Federation capabilities can provide a workaround. By making your SQL Server databases available in Databricks, you can federate queries directly to the source, enabling you to join live data with datasets already in Databricks. Whether it’s staging databases, data warehouses, or data marts, this approach allows you to build on your existing infrastructure while transitioning to a modern platform.

Show business value from day one

Instead of focusing solely on ingestion pipelines and modelling workflows, prioritise moving consumption use cases to Databricks early. By demonstrating business value—almost from day one—you can gain buy-in from stakeholders and justify further investments in the migration process.

Once the immediate needs are met, gradually shift your data sources from staging into a new ingestion pattern that leverages blob storage and AutoLoader. This step-by-step approach ensures a smoother transition while delivering results that matter to your business.

Ready to take the next step?

At RevoData, we specialize in helping organizations unlock the full potential of Databricks. Whether you’re migrating from SQL Server, optimizing your workflows, or building a modern data platform, our consultants are here to guide you every step of the way. Let us show you how Databricks can transform your data strategy and drive real business impact. Contact RevoData today to get started!

Rafal Frydrych

Senior Consultant at RevoData, sharing with you his knowledge in the opinionated series: Migrating from MSBI to Databricks.

From BI to Databricks: Simplifying Architecture Layers

Revodata — Mon, 08 Sep 2025 11:18:33 +0000

From BI to Databricks: Simplifying Architecture Layers

Over the past few weeks, we’ve been exploring the journey from traditional Business Intelligence (BI) to Databricks. As part of this transition, it’s essential to address a key aspect: architecture. While the terminology might seem daunting at first—Bronze, Silver, Gold, these layers aren’t so different from what you’re already familiar with. Let’s break it down and show how you can adapt this framework to suit your organization.

Layers Are Layers—Let’s Keep It Simple

When it comes to data architectures, we all think in layers. They bring structure and clarity to an otherwise complex ecosystem. So, if you’re transitioning to the medallion architecture with its Bronze, Silver, and Gold layers, don’t let the terminology overwhelm you. We’ve even seen customers add Platinum and Diamond to their layers—why not? If it works for your organization, it works! Remember, a framework is just a starting point; tailor it to fit your needs.

Mapping Staging to the Bronze Layer

The key is to focus on the characteristics of each layer. For example, in the MSBI world, a staging layer is where raw source data lands. It’s still structured around the source, with minimal transformation. The Bronze layer in Databricks serves the same purpose: it’s the raw, unprocessed representation of the source data. Once you see this connection, the transition becomes less intimidating.

Mapping the Data Warehouse to the Silver Layer

The Data Warehouse layer in MSBI aligns closely with the Silver layer in the medallion architecture. In this stage, you introduce organizational standards, naming conventions, and other structures while keeping data at its lowest granularity. This layer is your backbone, designed to remain stable over time.

One key difference in Databricks is the flexibility around traditional data modeling approaches like Kimball or Inmon (star-schema), Anchor modeling, or Data Vault. Here, you can choose how strictly to adhere to these techniques based on your organizational needs. However, it’s critical to ensure this layer is resilient. Changes to data sources or organizational structures should have minimal impact on your models. To achieve this, consider domain-driven design, bounded contexts, and data mesh principles—these sociotechnical concepts help keep your architecture flexible and future-proof.

The Data Mart Layer: Gold (or Platinum, or Diamond)

The final layer—often referred to as the Gold layer in Databricks—is where you optimize data for consumption. Whether it’s a one-big-table design, 3NF, or star-schema, this layer is about delivering business value. Because of its direct impact on the end user, this is where companies tend to allocate the most investment. However, it’s vital not to overlook the upstream layers. A stable foundation is the only way to ensure a reliable and effective Gold layer.

At RevoData, we’ve learned that a logical and user-friendly structure for your Data Catalog is key. Instead of naming catalogs “Bronze,” “Silver,” or “Gold,” we use descriptive labels like “sources,” “domains,” or “data products” and apply the familiar terms as metadata tags. This approach provides a clear path to data for all users while keeping the architecture intuitive and scalable.

Make your Architecture Work for You

Transitioning to Databricks doesn’t mean starting from scratch. By mapping your existing architecture to the medallion framework and customizing it for your organization, you can create a system that’s both familiar and future-ready.

Ready to Take the Next Step?

At RevoData, we specialize in helping organizations make the most of Databricks. Whether you’re starting your journey or looking to refine your approach, we’re here to support you. Let us show you how Databricks can transform your data strategy and deliver real business impact. Reach out to us today to get started!

Rafal Frydrych

Senior Consultant at RevoData, sharing with you his knowledge in the opinionated series: Migrating from MSBI to Databricks.