DBSCAN clustering and reverse geocoding for geospatial data

Table of Contents

Motive
#

The motive of the following post is to take a look at how to cluster geospatial data using the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm.

As a use case for this post we will look at how we can find the most representative data points for each continent to test geographical variation in a dataset for a theoretical ML model.

We will use the SEN12TP dataset for demostation here.

What is DBSCAN and why is it useful for geospatial clustering?
#

The DBSCAN algorithm is a popular density-based clustering algorithm used for identifying clusters and outliers in spatial data. It is particularly well-suited for geospatial data because it can handle data with irregularly shaped clusters and varying densities effectively, something that traditional algorithms such as k-means are not able to do.

DBSCAN defines clusters as dense regions of data points separated by regions of lower density. The algorithm works by exploring the density connectivity of the data points. To learn more about how it works visit the following https://www.analyticsvidhya.com/blog/2020/09/how-dbscan-clustering-works/.

The advantages of DBSCAN for geospatial data include-

Robustness to noise: DBSCAN can effectively handle outliers and noise points as they are considered separate from clusters. It does not assign them to any specific cluster and does not force them to belong to a cluster.
Ability to detect clusters of arbitrary shape: DBSCAN can identify clusters with irregular shapes, making it suitable for geospatial data where clusters may have complex spatial distributions.
Handling of varying cluster densities: DBSCAN can handle clusters of different densities. It can adapt to regions with high-density clusters as well as regions with low-density clusters.

Our problem statement
#

Let’s say we have a hypothetical model for sentinel-1 to sentinel-2 image to image transalation that we want to test on this dataset.
In particlar we want to test the model for geographical variance in performance per continent.
The dataset contains too many points per continent and we need some way to reduce these to get the most representative points for each continent.
We use the DBSCAN algorithm to find N most representative points per continent.

Downloading the dataset
#

The SEN12TP dataset contains paired set of imagery from the sentinel-1 and sentinel-2 satellite, but we won’t be using the imagery here, wplt.figure(figsize=(20, 20)) ax = plt.axes(projection=ccrs.PlateCarree()) ax.stock_img()

ax.plot(gdf[“lon”], gdf[“lat”], “o”, color=“r”) plt.title(“SEN12TP dataset locations”)to get geolocations from all over the world.

Download the metadata for the dataset using the following command -

wget https://raw.githubusercontent.com/Sambhav300899/blog_notebooks/main/clustering/dbscan/sen12tp-metadata.json

Loading the dataset
#

We can use geopandas to directly load the dataset.

import geopandas as gpd 

gdf = gpd.read_file("sen12tp-metadata.json")

The dataset is in a different CRS than WGS84, converting it to WGS84 format will allow us to look at it as lattitude and longitude values. The epsg code for WGS84 format is 4326. We can use geopandas to convert our data to the WGS84 projection.

Each row of the geopandas dataframe represents one image location. We extract the latitude and longitude of the centroid of each image location.

# convert to crs 4326 to get lat lon
gdf = gdf.to_crs(epsg=4326)
gdf = gdf[["geometry"]]

# get lat lon of centroids
gdf["centroid"] = gdf["geometry"].centroid
gdf["lat"] = gdf["centroid"].y
gdf["lon"] = gdf["centroid"].x

Lets see what all data we have in the dataframe now -

gdf.head()

	geometry	centroid	lat	lon
0	POLYGON ((-70.41461164636456 -34.46821038082358,…	POINT (-70.30 -34.55)	-34.5596	-70.3071
1	POLYGON ((72.28825977983655 34.050558629310395,…	POINT (72.399 33.96)	33.9627	72.3993
2	POLYGON ((-8.424361551813695 41.53310978422038,…	POINT (-8.305 41.44)	41.4423	-8.3054
3	POLYGON ((6.724249816765044 46.5562534318746,…	POINT (6.85 46.46)	46.4688	6.85822
4	POLYGON ((-0.554267167211116 42.60345308513507,…	POINT (-0.43 42.51)	42.5108	-0.43611

Plotting all of the locations
#

We will use the cartopy library to plot a map of the earth and the locations of our datapoints on it.

import cartopy.crs as ccrs

plt.figure(figsize=(20, 20))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.stock_img()

ax.plot(gdf["lon"], gdf["lat"], "o", color="r")
plt.title("SEN12TP dataset locations")

Reverese geocoding
#

We want to get N points per continent, it is important to know what continent each location belongs to before we do this. To do this we will use the reverse_geocoder library.

import reverse_geocoder as rg

continent_dict = {
    "NA": "North America",
    "SA": "South America",
    "AS": "Asia",
    "AF": "Africa",
    "OC": "Oceania",
    "EU": "Europe",
    "AQ": "Antarctica",
}

gdf["continent_code"] = rg.search(list(map(tuple, gdf[["lat", "lon"]].values)))
gdf["continent_code"] = gdf["continent_code"].apply(
    lambda x: pc.country_alpha2_to_continent_code(x["cc"])
)
gdf["continent_name"] = gdf["continent_code"].apply(lambda x: continent_dict[x])

Let’s plot the points again to confirm that we geocoded correctly.

plt.figure(figsize=(20, 20))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.stock_img()

for continent_name in gdf["continent_name"].unique():
    ax.plot(
        gdf[gdf["continent_name"] == continent_name]["lon"],
        gdf[gdf["continent_name"] == continent_name]["lat"],
        "o",
        label=continent_name,
    )

plt.legend()
plt.title("SEN12TP dataset locations")

Running DBSCAN per continent
#

We will now run the DBSCAN algorithm per continent to get 50 representative points for the same.

We first define functions to run the DBSCAN algorithm -

def get_centermost_point(cluster):
    # This function is useful for getting the centerpoint of a cluster

    # We first extract the centroid of the cluster
    centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)

    # we then calculate the distrances between all points and the centerpoint of the cluster and take the one with the least distance
    centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
    return tuple(centermost_point)


def get_n_filtered_pts(gdf_dbscan, max_pts=10, max_dist_km=200):
    # This function runs dbscan on a geodataframe and retruns `max_pts` representative locations for it. 
    # the max_dist_km is the maximum distance that a point can be from the center of the cluster
    # we will also convert all of our data to radians

    # we first extract the latitudes and longitudes in the form of a numpy array
    coords = gdf_dbscan[["lat", "lon"]].values
    
    # we then calculate the `epsilon` paramter which is converting our `max_dist_km` to radians
    # look at this to understadnd - https://stackoverflow.com/a/49212829
    kms_per_radian = 6371.0088
    epsilon = max_dist_km / kms_per_radian

    # Run DBSCAN
    db = DBSCAN(
        eps=epsilon, min_samples=1, algorithm="ball_tree", metric="haversine"
    ).fit(np.radians(coords))

    # get labels for each point in our dataset
    cluster_labels = db.labels_

    # get the number of clusters by looking at the number of unique clusters
    num_clusters = len(set(cluster_labels))

    # get the different points in each cluster
    clusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters)])

    # get the centermost point in each cluster
    filtered_pts = list(clusters.map(lambda x: get_centermost_point(x)))

    # if we have points more than the required amount, randomly sample
    if len(filtered_pts) > max_pts:
        filtered_pts = random.sample(filtered_pts, max_pts)

    return filtered_pts

We then use the following code to run DBSCAN per continent and create a geodataframe of the filtered data points.

num_pts_per_continent = 50
all_filtered_pts = []

# run per continent
for continent_name in gdf["continent_name"].unique():
    continent_gdf = gdf[gdf["continent_name"] == continent_name]
    all_filtered_pts.extend(get_n_filtered_pts(continent_gdf, num_pts_per_continent))

# get all lat and lons for all continents
lats, lons = zip(*all_filtered_pts)
rep_points = pd.DataFrame({"lon": lons, "lat": lats})

# extract all other information about these points from the original geodataframe
filtered_gdf = rep_points.apply(
    lambda row: gdf[(gdf["lat"] == row["lat"]) & (gdf["lon"] == row["lon"])].iloc[0],
    axis=1,
)

Let’s visualise this now to see the final datapoints we get.

plt.figure(figsize=(20, 20))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.stock_img()

for continent_name in filtered_gdf["continent_name"].unique():
    ax.plot(
        filtered_gdf[filtered_gdf["continent_name"] == continent_name]["lon"],
        filtered_gdf[filtered_gdf["continent_name"] == continent_name]["lat"],
        "o",
        label=continent_name,
    )

plt.legend()
plt.title("SEN12TP dataset locations")

Conclusion
#

We see that we get a pretty good representation of the geograhical locations of our dataset in these points. We can now furthur use this to test our models geograhical performance.

Github Notebook

Motive#

What is DBSCAN and why is it useful for geospatial clustering?#

Our problem statement#

Downloading the dataset#

Loading the dataset#

Plotting all of the locations#

Reverese geocoding#

Running DBSCAN per continent#

Conclusion#

Related