Temperatures & Visualization
What’s the best way to guess the temperature at the top of a mountain?
Climate.
Intro and Foundations
Today we are going to examine multiple dataframes using sqlite3. These dataframes include NOAA station metadata, FIPS_10-4 & ISO 3166 ID codes for countries, and NOAA station temperature data.
☀1
First off, in order to execute our plan, we need to import our initial models.
import pandas as pd
import sqlite3
from plotly import express as px
from plotly.io import write_html
import numpy as np
from sklearn.linear_model import LinearRegression
import plotly.graph_objects as go
from plotly.io import write_html
Now that our modules are imported, lets import our csvs, and use SQL to combine them into a SQL database!
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/noaa-ghcn/station-metadata.csv"
stations = pd.read_csv(url)
temps = pd.read_csv("temps_stacked.csv")
countries = pd.read_csv('countries.csv')
# sql doesn't like spaces, so we edit the column names to get rid of it
countries = countries.rename(columns= {"FIPS 10-4": "FIPS_10-4"})
We can take a look at list(countries.columns) + list(temps.columns) + list(stations.columns)
to see the available columns.
[‘FIPS_10-4’, ‘ISO 3166’, ‘Name’, ‘ID’, ‘Year’, ‘Month’, ‘Temp’, ‘ID’, ‘LATITUDE’, ‘LONGITUDE’, ‘STNELEV’, ‘NAME’]
Now we can create our database using sql.
#open a connection to temps.db so that you can 'talk' to it using python
conn = sqlite3.connect("temps.db")
temps.to_sql("temperatures", conn, if_exists="replace", index=False)
countries.to_sql("countries", conn, if_exists="replace", index=False)
stations.to_sql("stations", conn, if_exists="replace", index=False)
# always close your connection
conn.close()
☀2
Now that we have created our database, lets try creating a query function called “query_climate_database()” that will obtain temperatures for a range of years on a specific month, for any country selected.
def query_climate_database(country, year_begin, year_end, month):
#obtain 2 letter country ID
countree = countries[countries['Name']== str(country)].iloc[0]['FIPS_10-4']
#connect to the database
conn = sqlite3.connect("temps.db")
cmd_text_1 = f'''
SELECT S.name, S.latitude, S.longitude, SUBSTRING(S.id, 1, 2) Country, T.year, T.month, T.temp
FROM temperatures T
LEFT JOIN stations S ON T.id = S.id
WHERE (SUBSTRING(S.id,1,2) = '{countree}') AND (T.year BETWEEN {year_begin} AND {year_end}) AND (T.month == {month})
'''
cmd = cmd_text_1
#write the cmd into a df
df = pd.read_sql_query(cmd, conn)
conn.close() #close our connection
return df #output our df
Lets Implement our Function!
☀3
Using our function, now lets take this function and apply some graphical functions on top of it! Lets create a function called temperature_coefficient_plot that will create a scatter plot on top of a global map image that shows the average yearly temperature change for a selected month in a selected country.
#this function will use LinearRegression to find the coefficient of Year x Temp and return the value
def coef(dfs):
X = dfs[["Year"]]
y = dfs["Temp"]
LR = LinearRegression()
LR.fit(X, y)
slope = LR.coef_[0]
return slope
#this function takes our df, converts numbers and adds required columns to our final df
def change_in_temp_creator(df):
#make the column; groupbyname -> isolate "Temp" colum and find length... add length to
#the column
df['obs'] = df.groupby(['NAME'])["Temp"].transform(len)
#rounding out the long/lat
df[["LATITUDE", "LONGITUDE"]] = df[["LATITUDE", "LONGITUDE"]].round(2)
#mask the 'obs' and get rid of all rows w/ obs < 10
df = df[df['obs'] >= 10]
#apply our coef function with the selected dataframe
coefs = df.groupby(["NAME", "Month","LATITUDE", "LONGITUDE"]).apply(coef)
coefs = coefs.reset_index() #resets our index
#change column name to be more descriptive
coefs.rename(columns = {0: "Est. yearly increase Celcius"}, inplace = True)
coefs["Est. yearly increase Celcius"] = coefs["Est. yearly increase Celcius"].round(4) #round our values
return (coefs) #df
#this is our core function to output our eventual plot
def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, **kwargs):
#init df
df = query_climate_database(country, year_begin, year_end, month)
df1 = change_in_temp_creator(df)
month_name = months[str(month)]
fig = px.scatter_mapbox(df1,
lat = "LATITUDE", # column name for latitude informataion
lon = "LONGITUDE", # column name for longitude information
opacity = 1,
hover_name = "NAME",
hover_data= ["Est. yearly increase Celcius"],
color = "Est. yearly increase Celcius",
range_color=[-0.1,0.1],
**kwargs)
fig.update_layout(
title={
'text': f"Est. in yearly temp. increase in {month_name} for stations in {country}, years {year_begin} - {year_end}",
'y':0.96,
'x':0.52,
'xanchor': 'center',
'yanchor': 'top'})
return fig
color_map = px.colors.sequential.thermal #set a color map
fig = temperature_coefficient_plot("India", 1980, 2020, 1,
min_obs = 10,
zoom = 3,
mapbox_style="carto-positron",
color_continuous_scale = color_map)
fig.show()
From the plot, we can see the distribution of the stations in India, in addition to their increase coefficient on the map. Additionally, we can input new parameters and reuse our function!
color_map = px.colors.sequential.thermal # choose a colormap
fig = temperature_coefficient_plot("China", 1975, 2020, 12,
min_obs = 20,
zoom = 2,
mapbox_style="carto-positron",
color_continuous_scale = color_map)
fig.show()
Lets explore our Database some more!
☀4a)
Lets say we want more detailed descriptive statistics on the temperatures collected from our global stations. Using a boxplot, we can easily obtain and visualize such data.
Lets create a function that will take in a specific country, a time range, and additional graphing parameters and that outputs a boxplot accordingly.
#this is an SQL query function that will output our desired df based on the inputs
def boxplot_stations(country, year_begin, year_end):
#obtain 2 letter country ID
countree = countries[countries['Name']== str(country)].iloc[0]['FIPS_10-4']
#connect to the database
conn = sqlite3.connect("temps.db")
cmd_text_1 = f'''
SELECT S.name, SUBSTRING(S.id, 1, 2) Country, T.year, T.month, T.temp
FROM temperatures T
LEFT JOIN stations S ON T.id = S.id
WHERE (SUBSTRING(S.id,1,2) = '{countree}') AND (T.year BETWEEN {year_begin} AND {year_end})
'''
cmd = cmd_text_1
#write the cmd into a df
df = pd.read_sql_query(cmd, conn)
conn.close() #close our connection
return df #output our df
#contains the core code to plot the dataframe
def boxplot_creator(country, year_begin, year_end, **kwargs):
#init df
df = boxplot_stations(country, year_begin, year_end)
fig = px.box(df,
y = "Temp",
animation_frame = "Year",
notched = True,
width = 700,
height = 700,
color = "Year",
title = f"Boxplot of Temps in {country} over {year_begin} - {year_end}",
**kwargs)
return fig
color = px.colors.sequential.Plasma #creating our colormap
fig = boxplot_creator("Russia", 1980, 2000,
color_discrete_sequence = color)
fig.show()
Looking at our plot, we can now get an time series of the descriptive statistics in a specific Country.
☀4b)
Additionally, lets now create another function to look at the distribution of stations and their elevations of a specific country during a certain year.
Lets create a function that will take in a country, year, and additional graphing parameters and outputs the correct mapbox accordingly.
def obtain_stnelevs(country, year_begin):
countree = countries[countries['Name']== str(country)].iloc[0]['FIPS_10-4']
#connect to the database
conn = sqlite3.connect("temps.db")
cmd_text_1 = f'''
SELECT S.STNELEV as Elevation, S.name, S.latitude, S.longitude, SUBSTRING(S.id, 1, 2) Country, T.year
FROM temperatures T
LEFT JOIN stations S ON T.id = S.id
WHERE (SUBSTRING(S.id,1,2) = '{countree}') AND (T.year = {year_begin})
'''
cmd = cmd_text_1
#write the cmd into a df
df = pd.read_sql_query(cmd, conn)
conn.close() #close our connection
return df #output our df
def stnelev_plot(country, year_begin, **kwargs):
#init df
df = obtain_stnelevs(country, year_begin)
#clean up the df
#because indexes where station name are the same contain the same
#lat/long/elevation, we should clean all duplicates so we have unique stations
df1 = df.drop_duplicates().reset_index().drop(["index"], axis = 1)
fig = px.scatter_mapbox(df1,
lat = "LATITUDE", # column name for latitude informataion
lon = "LONGITUDE", # column name for longitude information
opacity = 0.5,
hover_name = "NAME",
hover_data= ["Elevation"],
color = "Elevation",
range_color = [0, 5000],
**kwargs)
return fig
color_map = px.colors.diverging.Spectral
fig = stnelev_plot("United States", 2020, mapbox_style= 'carto-darkmatter',
color_continuous_scale = color_map, zoom = 3, )
fig.show()
Look at the diagram, we can now examine any country in the stations database at any available year to understand the distribution and elevations of the stations.
Thanks for reading! See you next time.