Using pandas
Contents
Using pandas#
Before we get started, we test the learning environment and the most important packages needed to run the notebooks. This is not so much a continuous coherent exercise as individual examples based on the different packages.This exercise is neither an introduction to Python nor extensive tutorials for the individual packages. I advise you, if you have little or no experience with the packages, to work through the relevant tutorial on the websites. All packages offer very good and extensive tutorials. Most of the functions presented here have been taken from these websites.
- Getting to know the learning environment
- Testing the pandas packages
- Very brief overview of the function of the package
- Testing pandas
Getting started#
Start using pandas. To load the pandas package and start working with it, import the package. The community agreed alias for pandas is pd.
# Load the pandas package
import pandas as pd
Data in Pandas is represented as a table, the so-called DataFrame. A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the data.frame in R. Each column in a DataFrame is a Series.
We start by reading data from a csv file into a DataFrame. pandas provides the read_csv() function to read data stored as a csv file into a pandas DataFrame. pandas supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, …), each of them with the prefix read_*.
Let’s start and load a automatic weather station file into a pandas dataframe
# Read the data into a DataFrame
df = pd.read_csv("https://raw.githubusercontent.com/sauterto/clim_env_hydro/main/docs/nb/data/aws_valley_data_10min.csv", header=1, index_col='TIMESTAMP')
and have a look at the dataframe
# A simple way to plot the DataFrame
df.head()
RECORD | BattV | PTemp_C | AirTC_1 | RH_1 | AirTC_2 | RH_2 | SlrW_1 | SlrW_2 | NR_Wm2 | H_Flux | WS_ms | WindDir | WindDir_Std | Rain_mm_Tot | T107C_1 | T107C_2 | T107C_3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TIMESTAMP | ||||||||||||||||||
2014-08-15 17:10:00 | 0 | 12.40 | 5.397 | 2.228 | 95.00 | 2.552 | 93.40 | 398.6 | 59.18 | 295.5 | 11.17 | 1.823 | 109.00 | 32.94 | 0.2 | 6.817 | 6.552 | 7.092 |
2014-08-15 17:20:00 | 1 | 12.41 | 4.988 | 2.674 | 91.50 | 2.754 | 91.20 | 181.2 | 28.76 | 98.9 | 9.59 | 0.941 | 13.32 | 35.72 | 0.3 | 6.816 | 6.541 | 7.357 |
2014-08-15 17:30:00 | 2 | 12.42 | 4.665 | 3.108 | 83.90 | 3.260 | 84.50 | 664.7 | 107.00 | 550.6 | 35.51 | 0.510 | 121.20 | 80.90 | 0.1 | 6.895 | 6.552 | 7.485 |
2014-08-15 17:40:00 | 3 | 12.42 | 4.859 | 4.057 | 76.95 | 4.068 | 75.96 | 561.1 | 92.30 | 403.7 | 38.57 | 2.607 | 119.50 | 34.43 | 0.0 | 6.992 | 6.560 | 7.820 |
2014-08-15 17:50:00 | 4 | 12.43 | 5.225 | 4.128 | 76.08 | 4.290 | 75.09 | 439.7 | 73.22 | 288.2 | 23.81 | 0.941 | 297.30 | 76.48 | 0.0 | 7.090 | 6.531 | 8.100 |
We can select a Series from the DataFrame with
# Retrieve the air temperature series from the DataFrame
df['AirTC_1']
do some calculations
# Get the maximum of the air temperature series
df['AirTC_1'].max()
As illustrated by the max() method, you can do things with a DataFrame or Series. pandas provides a lot of functionalities, each of them a method you can apply to a DataFrame or Series. As methods are functions, do not forget to use parentheses ().
You can also get some basic statistics of the data with
df.describe()
The describe() method provides a quick overview of the numerical data in a DataFrame. Textual data is not taken into account by the describe() method.
You can simply select specific columns from a DataFrame with
# That's how you select the AirTC_1 and RH_1 columns from the df DataFrame
df_subset = df[["AirTC_1","RH_1"]]
# Plot the header (first 5 rows)
df_subset.head()
The shape of the DataFrame can be accessed with
# Access the shape attribute. Please note, do not use parentheses for attributes.
df_subset.shape
Often you need to filter specific rows from the DataFrame, e.g.
With the following command you can simply select all rows with temperatures above 5ºC
# Select all rows with temerature greather than 5 degrees celsius
T_subset = df_subset[df_subset["AirTC_1"] > 5.0]
# Plot the header rows
T_subset.head()
It is possible to combine multiple conditional statements, each condition must be surrounded by parentheses (). Moreover, you can not use or/and but need to use the or operator | and the and operator &. Here is an example
# Select all rows with temerature greather than 5 degrees celsius and a relative humidity above 70%
T_RH_subset = df_subset[(df_subset["AirTC_1"] > 5.0) & (df_subset["RH_1"] > 70.0)]
# Plot the header rows
T_RH_subset.head()
Often you want to create plots from the data.
To make use of the plotting function you need to load the matplotlib package
# Import matplotlib
import matplotlib.pyplot as plt
You can quickly check the data visually
# Plot the temperature time series
df["AirTC_1"].plot()
# Rotate the x-labels for better readability
plt.xticks(rotation=30);
Or create horizontally stacked plots, add two time series in one plot etc.
# Create two subplots (1 row, 2 columns)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
fig.suptitle('Horzontally stacked subplots')
# Plot the temperature time series
df["AirTC_1"].plot(ax=ax1);
# Rotate the x-labels for better readability
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45, ha='right');
# Plot two temperature time series in one plot
df[["AirTC_2","AirTC_1"]].plot(ax=ax2);
# Rotate the x-labels for better readability
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=45, ha='right');
Here is an example of a box plot
df[["AirTC_1","RH_1","H_Flux"]].plot.box(figsize=(10,5))
And a simple way to plot all variables in a DataFrame
# Create subplots
df[["AirTC_1","RH_1","H_Flux"]].plot(figsize=(15, 5), subplots=True);
# Rotate the x-labels for better readability
plt.xticks(rotation=30);
- Import the package, aka import pandas as pd
- A table of data is stored as a pandas DataFrame
- Each column in a DataFrame is a Series
- You can do things by applying a method to a DataFrame or Series