--- tags: geopandas, shapely, wine date: "2023-07-11" category: "Geographic Data" jupytext: text_representation: extension: .md format_name: myst format_version: 0.13 jupytext_version: 1.14.7 kernelspec: display_name: Python 3 (ipykernel) language: python name: python3 --- +++ {"user_expressions": []} Creating a Map of the wine AOCs of France with geopandas and shapely ================= **Important note: The code displayed in this notebook takes some time to execute (in my computer, about 10 minutes). To render the sphinx file, I had to modify the ```conf.py``` file and add: ```nb_execution_timeout = 1000``` (defaults to 30)**. In this post, I will be creating a map of the wine regions (AOC) of the Loire Valley, from open Data. We will get the data from the open data source, do some simple transformations using shapely, and plot the result. We will also upload the result to Kaggle, so others can use our new data set. ## Step 1: Download the dataset. We will be getting the dataset from the [French Government Open Data portal](https://www.data.gouv.fr/fr/datasets/delimitation-parcellaire-des-aoc-viticoles-de-linao/). We download the ```.zip``` file, unzip it and place the shapefiles in a folder. In my case, here is the path to the shapefile: ```{code-cell} ipython3 path_to_shp = "aoc_map_20230605/2023-07-03_delim-parcellaire-aoc-shp.shp" ``` +++ {"user_expressions": []} Now, we can import our libraries for the project: ```{code-cell} ipython3 import geopandas as gpd import pandas as pd from matplotlib import pyplot as plt ``` +++ {"user_expressions": []} ## Step 2: Creating a GeoPandas GeoDataFrame GeoPandas knows shapefiles, so this as simple as reading the file: ```{code-cell} ipython3 aoc_raw_data = gpd.read_file(path_to_shp) ``` ```{code-cell} ipython3 aoc_raw_data.head(3) ``` +++ {"user_expressions": []} Our DataFrame has many different informations. Since this is from the French administration, all refers to French. Here are some of the relevant columns * ```dt```: *Délégation territoriale* This an administrative department. Not relevant as a wine origin division, but they cover kind of homogeneous areas. We are keeping this for our GeoDataFrame, at least to split it in the beginning and do some explorations. * ```categorie```: wine type(s) * ```signe```: it is always AOC. * ```id_app```: unique identifier for the AOC * ```app```: short for *appellation*, it is the name of the AOC. * ```geometry```: holds the geometry of the viticultural area parcels as Multipolygons. Let's see how many distinct administrative departments (```dt```) we have (*note that I use the ```.size()``` command, this counts the number of rows. The granularity here is: towns x AOC name, so the number is not really relevant nor useful. This command is really just to have an idea on how we can start splitting the DataSet to explore it, because it is big*). ```{code-cell} ipython3 # Get the distinct dt of the dataset dt_distinct = aoc_raw_data.groupby("dt").size().reset_index(name="number_of_dt") dt_distinct ``` ```{code-cell} ipython3 aoc_raw_data.head(2) ``` +++ {"user_expressions": []} ## Step 3: Simplyfy Geometries In order to make the treatment a bit less heavy, I simplify the geometry of the DataFame: * I create a new DataFrame, ```simplified_data```, with the same columns as ```aoc_raw_data```. Then, I iterate over the rows of ```aoc_raw_data``` to add each row, with simplified geometry, to the new DataFrame * I use ```gpd.simplify()```. This is [used to reduce the number of points](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.simplify.html) in the polygon (which obviously reduces the size of it). I set the tolerance to 0.1, because we do not need a lot of precision for our DataSet, we are not going to use it for legal purpose. We are more interested in having something a bit lighter. * I also use ```gpd.buffer()``` * Both ```gpd.simplify()``` and ```gpd.buffer()``` are geoSeries, this is why we itrerate on each row of ````aoc_raw_data``` * The expression ```.to_frame().transpose()``` allows us to make a Series become a DataFrame, and then transpose it to insert it as a row in the ```simplified_data``` DataFrame. ```{code-cell} ipython3 # Set the tolerance value for simplification tolerance = 0.1 # Create a new GeoDataFrame for simplified geometries simplified_data = gpd.GeoDataFrame(columns=aoc_raw_data.columns) # Iterate over each row in the original GeoDataFrame for _, row in aoc_raw_data.iterrows(): # Simplify the geometry of the current row simplified_geometry = row["geometry"].simplify(tolerance) # Buffer the simplified geometry to merge nearby polygons within the same MultiPolygon buffered_geometry = simplified_geometry.buffer(0) # Create a new row with the simplified and buffered geometry simplified_row = row.copy() simplified_row["geometry"] = buffered_geometry # Append the simplified row to the new GeoDataFrame simplified_data = pd.concat( [simplified_data, simplified_row.to_frame().transpose()], ignore_index=True ) # Convert the geometry column to the appropriate geometry type simplified_data["geometry"] = gpd.GeoSeries(simplified_data["geometry"]) ``` ```{code-cell} ipython3 simplified_data.head(2) ``` +++ {"user_expressions": []} ## Step 4: Dissolve the dataframe Disolving the DataFrame takes **A LOT** of time. But we will be happy to have a less heavy file at the end of all this process! The shapely function ```unary_union``` allows us to dissolve the geometries of the GeoDataFrame, grouping them by a certain attribute. To minimize the data and the execution time, we create a minimal DataFrame ```only_aoc``` by only keeping some columns from ```simplified_data```. ```{code-cell} ipython3 from shapely.ops import unary_union ``` ```{code-cell} ipython3 only_aoc = simplified_data[["app", "id_app", "dt", "geometry"]] ``` ```{code-cell} ipython3 # Group the GeoDataFrame by the 'app' attribute grouped_data = only_aoc.groupby(["app", "id_app"]) # Perform unary union within each group merged_geometries = grouped_data["geometry"].apply(unary_union) # Create a new GeoDataFrame with the merged geometries aoc_data = gpd.GeoDataFrame({"geometry": merged_geometries}, crs=only_aoc.crs) # Add the 'app' and 'id_app' attributes from the first row of each group aoc_data["app"] = grouped_data["app"].first().values aoc_data["id_app"] = grouped_data["id_app"].first().values aoc_data["dt"] = grouped_data["dt"].first().values ``` +++ {"user_expressions": []} Next, we reset the index to have all the information in rows. We also set the crs to ```EPSG:2154```. This is the code for Lambert 93 projections, used in general in France (and specified in the shapefiles folders). ```{code-cell} ipython3 # Reset the index aoc_data = aoc_data.reset_index(drop=True) # Set the CRS: aoc_data.crs = "EPSG:2154" ``` +++ {"user_expressions": []} And then we save the data and we make a quick plot to see how it all looks. By the way, by "quick", I mean "not quick at all", the amount of data is important. ```{code-cell} ipython3 aoc_data.head(3) ``` +++ {"user_expressions": []} ## Step 5: Save the DataFrame ```{code-cell} ipython3 aoc_data.to_file('aoc.gpkg', driver='GPKG', layer='aocs_france') ``` ```{code-cell} ipython3 aoc_data.plot() # I do this to see how it looks like, not a good plot -- by the way, it takes long ``` +++ {"user_expressions": []} ## Step 6: Plotting a Region: Loire Valley wines Due to very heavy size of the DataFrame, we are going to only plot a part of it. We will take the "dt" of Angers (which is the west of the Loire Valley) It is important to convert the CRS to 3857, it is one of the supported formats for Open Street Maps. ```{code-cell} ipython3 aoc_data_angers = aoc_data[aoc_data["dt"] == "Angers"] ``` ```{code-cell} ipython3 aoc_data_angers = aoc_data_angers.to_crs("EPSG:3857") # 2154 is unsupported, on ``` ```{code-cell} ipython3 import contextily as ctx import tilemapbase ``` ```{code-cell} ipython3 tilemapbase.start_logging() tilemapbase.init(create=True) extent = tilemapbase.extent_from_frame(aoc_data_angers, buffer=5) ``` ```{code-cell} ipython3 fig, ax = plt.subplots(figsize=(10, 10)) plotter = tilemapbase.Plotter(extent, tilemapbase.tiles.build_OSM(), width=1000) plotter.plot(ax) aoc_data_angers.plot(column="app", ax=ax, legend=False) ``` +++ {"user_expressions": []} And then we save the map. Nottice the ```dpi``` parameter, it means: "dots per inch", I use it to increase the quality of the image ```{code-cell} ipython3 ax.figure.savefig("./wines_loire.png", bbox_inches="tight", dpi=300) ``` +++ {"user_expressions": []} ## Step 7: Making a folium map Folium lets us do interactive maps for a website, an application or a notebook report. it is pretty neeat, but it can be ressource intensive. The last thing we want is a 100 MB webpage, so I will plot the data for the ```dt``` of Pau. The newest versions of GeoPandas let you create a folium map directly with the ```.explore()``` method, [which is AWESOME](https://geopandas.org/en/stable/docs/user_guide/interactive_mapping.html). ```{code-cell} ipython3 aoc_data_pau = aoc_data[aoc_data["dt"] == "Pau"] ``` ```{code-cell} ipython3 import folium #Use chatgpt to get the coordinates of places :) m = folium.Map(location=[43.296482, -0.370027], zoom_start=8) ``` ```{code-cell} ipython3 aoc_data_pau.explore(column="app", m=m, cmap="Dark2") ``` +++ {"user_expressions": []} That's it for today! I will be doing more with this, so bookmark this blog! You can fin the dataset on [this link on kaggle](https://www.kaggle.com/datasets/ericnarro/french-vine-aoc-areas).