Using pandas#

Before we get started, we test the learning environment and the most important packages needed to run the notebooks. This is not so much a continuous coherent exercise as individual examples based on the different packages.This exercise is neither an introduction to Python nor extensive tutorials for the individual packages. I advise you, if you have little or no experience with the packages, to work through the relevant tutorial on the websites. All packages offer very good and extensive tutorials. Most of the functions presented here have been taken from these websites.

Learning objectives:
  • Getting to know the learning environment
  • Testing the pandas packages
  • Very brief overview of the function of the package
How to proceed:
  • Testing pandas

Getting started#

Start using pandas. To load the pandas package and start working with it, import the package. The community agreed alias for pandas is pd.

# Load the pandas package
import pandas as pd

Data in Pandas is represented as a table, the so-called DataFrame. A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the data.frame in R. Each column in a DataFrame is a Series.

../../_images/01_table_dataframe.svg

We start by reading data from a csv file into a DataFrame. pandas provides the read_csv() function to read data stored as a csv file into a pandas DataFrame. pandas supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, …), each of them with the prefix read_*.

../../_images/02_io_readwrite.svg

Let’s start and load a automatic weather station file into a pandas dataframe

# Read the data into a DataFrame
df = pd.read_csv("https://raw.githubusercontent.com/sauterto/clim_env_hydro/main/docs/nb/data/aws_valley_data_10min.csv", header=1, index_col='TIMESTAMP')

and have a look at the dataframe

# A simple way to plot the DataFrame
df.head()
RECORD BattV PTemp_C AirTC_1 RH_1 AirTC_2 RH_2 SlrW_1 SlrW_2 NR_Wm2 H_Flux WS_ms WindDir WindDir_Std Rain_mm_Tot T107C_1 T107C_2 T107C_3
TIMESTAMP
2014-08-15 17:10:00 0 12.40 5.397 2.228 95.00 2.552 93.40 398.6 59.18 295.5 11.17 1.823 109.00 32.94 0.2 6.817 6.552 7.092
2014-08-15 17:20:00 1 12.41 4.988 2.674 91.50 2.754 91.20 181.2 28.76 98.9 9.59 0.941 13.32 35.72 0.3 6.816 6.541 7.357
2014-08-15 17:30:00 2 12.42 4.665 3.108 83.90 3.260 84.50 664.7 107.00 550.6 35.51 0.510 121.20 80.90 0.1 6.895 6.552 7.485
2014-08-15 17:40:00 3 12.42 4.859 4.057 76.95 4.068 75.96 561.1 92.30 403.7 38.57 2.607 119.50 34.43 0.0 6.992 6.560 7.820
2014-08-15 17:50:00 4 12.43 5.225 4.128 76.08 4.290 75.09 439.7 73.22 288.2 23.81 0.941 297.30 76.48 0.0 7.090 6.531 8.100

We can select a Series from the DataFrame with

# Retrieve the air temperature series from the DataFrame
df['AirTC_1']

do some calculations

# Get the maximum of the air temperature series
df['AirTC_1'].max()

As illustrated by the max() method, you can do things with a DataFrame or Series. pandas provides a lot of functionalities, each of them a method you can apply to a DataFrame or Series. As methods are functions, do not forget to use parentheses ().

You can also get some basic statistics of the data with

df.describe()

The describe() method provides a quick overview of the numerical data in a DataFrame. Textual data is not taken into account by the describe() method.

You can simply select specific columns from a DataFrame with

# That's how you select the AirTC_1 and RH_1 columns from the df DataFrame
df_subset = df[["AirTC_1","RH_1"]]

# Plot the header (first 5 rows)
df_subset.head()

The shape of the DataFrame can be accessed with

# Access the shape attribute. Please note, do not use parentheses for attributes. 
df_subset.shape

Often you need to filter specific rows from the DataFrame, e.g.

../../_images/03_subset_rows.svg

With the following command you can simply select all rows with temperatures above 5ºC

# Select all rows with temerature greather than 5 degrees celsius
T_subset = df_subset[df_subset["AirTC_1"] > 5.0]

# Plot the header rows
T_subset.head()

It is possible to combine multiple conditional statements, each condition must be surrounded by parentheses (). Moreover, you can not use or/and but need to use the or operator | and the and operator &. Here is an example

# Select all rows with temerature greather than 5 degrees celsius and a relative humidity above 70%
T_RH_subset = df_subset[(df_subset["AirTC_1"] > 5.0) & (df_subset["RH_1"] > 70.0)]

# Plot the header rows
T_RH_subset.head()

Often you want to create plots from the data.

../../_images/04_plot_overview.svg

To make use of the plotting function you need to load the matplotlib package

# Import matplotlib
import matplotlib.pyplot as plt

You can quickly check the data visually

# Plot the temperature time series
df["AirTC_1"].plot()

# Rotate the x-labels for better readability
plt.xticks(rotation=30);

Or create horizontally stacked plots, add two time series in one plot etc.

# Create two subplots (1 row, 2 columns)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
fig.suptitle('Horzontally stacked subplots')

# Plot the temperature time series
df["AirTC_1"].plot(ax=ax1);
# Rotate the x-labels for better readability
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45, ha='right');

# Plot two temperature time series in one plot
df[["AirTC_2","AirTC_1"]].plot(ax=ax2);
# Rotate the x-labels for better readability
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=45, ha='right');

Here is an example of a box plot

df[["AirTC_1","RH_1","H_Flux"]].plot.box(figsize=(10,5))

And a simple way to plot all variables in a DataFrame

# Create subplots
df[["AirTC_1","RH_1","H_Flux"]].plot(figsize=(15, 5), subplots=True);

# Rotate the x-labels for better readability
plt.xticks(rotation=30);
Note This is just a starting point. Similar to spreadsheet software, pandas represents data as a table with columns and rows. Apart from the representation, also the data manipulations and calculations you would do in spreadsheet software are supported by pandas.
Reminder
  • Import the package, aka import pandas as pd
  • A table of data is stored as a pandas DataFrame
  • Each column in a DataFrame is a Series
  • You can do things by applying a method to a DataFrame or Series
Homework: Check out the pandas tutorial and get familiar with the syntax.