Last year we discovered an extensive dataset on the subject of traffic on German roads provided by the BASt. It holds detailed numbers of cars, trucks and other vehicle groups passing more than 1,500 automatic counting stations. The amazing thing about this dataset is that the records for each counting station are provided on an hourly basis and they reach back to the year 2003. A dataset with aggregated yearly data per station is also available. As this seemed like a pretty exciting dataset to explore, we started to come up with ideas for a visualisation. The project also got accepted by the mFUND program.
As an attempt to get to know the structure and to find a good way for dealing with the massive size of the dataset, we set up some Jupyter (formerly IPython) Notebooks. Notebooks are a great tool for doing data-analysis, so we used these to create simple charts and graphs that answer basic questions and give us insights into the data. In the following sections you can see how we produced helpful charts of a big amount of data with only a few lines of code.
First of all, you’ll need to import some basic libraries. In our case, these are pandas, which provides data-structures, the tools to handle them and I/O utilities to read and write from and to different datasources, and matplotlib, which we will use to create the charts. As we chose not to use a predefined color scheme, we also defined an array of colors for the graphs.
import pandas as pd import matplotlib.pyplot as plt from matplotlib.ticker import FuncFormatter colors = ['#2678B2', '#AFC8E7', '#FD7F28', '#FDBB7D', '#339E34', '#9ADE8D', '#D42A2F', '#FD9898', '#9369BB', '#C5B1D4', '#8B564C', '#C39C95', '#E179C1', '#F6B7D2', '#7F7F7F', '#C7C7C7']
In the following sections, we will produce three different types of charts: A linechart, a scatterplot and a stacked area chart. The linechart and the scatterplot will be based on the hourly data, while the stacked area chart will be used to display the data by year.
We will use a linechart to outline the frequentation of a single counting station during the year 2016. Therefore, we have to load the data from a csv into a
DataFrame object and use the
group_by function to group it by station number. As it would be a lot of data to display on an hourly base, we will sum up the numbers for each day by first grouping them and then summing up the columns for both directions:
# download file from http://www.bast.de/videos/2016_A_S.zip df = pd.read_csv('2016_A_S.txt', sep=';') df = df[['Zst', 'Land', 'KFZ_R1', 'KFZ_R2', 'Datum', 'Stunde']] df.columns = ['nr', 'land', 'cars_r1', 'cars_r2', 'date', 'hour'] # convert values in 'date' column to DateTime df.loc[df.hour == 24, 'hour'] = 0 df['date'] = pd.to_datetime( df['date'].astype(str) + '-' + df['hour'].astype(str), format='%y%m%d-%H') grouped_by_name = df.groupby(['name']) grouped_by_name = df.groupby(['nr']) number = 1177 station = grouped_by_name.get_group(number) station_days = station.set_index('date').groupby([pd.TimeGrouper('D')])[['cars_r1', 'cars_r2']].sum().reset_index()
The code for generating a linechart is quite straight forward. We use subplots to set a size for the chart (
figsize=(15, 8)). In
ax.plot(y, x, ...), one line is plotted to the chart. This is done twice, as traffic should be shown in both directions. The
set_major_formatter function is used for formatting the values of a chart’s axis. In this case, numbers are displayed with ’,’ as thousands separator.
plt.show() will display the chart.
fig, ax = plt.subplots(figsize=(15, 8), dpi=200) ax.plot(station_days['date'], station_days['cars_r1'], c=colors) ax.plot(station_days['date'], station_days['cars_r2'], c=colors) ax.yaxis.set_major_formatter(FuncFormatter(lambda x, p: format(int(x), ','))) ax.legend(loc='upper right', bbox_to_anchor=(1.1, 1.02)) plt.title('station nr: ' + str(number)) plt.margins(0) plt.show()
Having a look at this chart, it becomes clear that there is a lot more traffic in direction 1 than in direction 2 throughout the year at station 1177. The reason could be that this is the first counting station on the Autobahn A23. The ups and downs in the chart make it clear that there is a less traffic during the weekends.
To show the top 3 busiest days per state, we decided to use a scatterplot. Before the graph can be generated, the dataset needs a bit of restructuring. First, we add a column
sum (sum of the traffic in both directions) to the
DataFrame. Then the data is grouped by both land and day before sorting it by
sum. As only the top three days per state should be displayed,
head(3) is used to in order to get the first five elements.
df_scatter = df df_scatter['sum'] = df_scatter['cars_r1'] + df_scatter['cars_r2'] df_byday = df_scatter.set_index('date').groupby(['land', pd.TimeGrouper('D')])[['sum']].sum().reset_index() df_sorted = df_byday.sort_values(by=['sum'], ascending=False) df_clean = df_sorted.groupby(['land']).head(3)
When it comes to generating the chart, we need the data grouped by
land. For each land all elements can be added to the plot by using
ax.scatter(x, y, ...). Make sure to set a label, otherwise it’s not possible to create a legend:
groups = df_clean.groupby('land') scatter_colors = iter(colors) fig, ax = plt.subplots(figsize=(15, 8), dpi=200) for name, group in groups: dates = group['date'].dt.strftime('%Y-%m-%d').values dates = [pd.to_datetime(d) for d in dates] ax.scatter(dates, group['sum'].values, s=150, c=next(scatter_colors), label=name) ax.yaxis.set_major_formatter(FuncFormatter(lambda x, p: format(int(x), ','))) ax.legend(loc='upper right', bbox_to_anchor=(1.1, 1.015)) plt.show()
As you can see, the top 3 busiest days in the federal states focus on ~10 days of the year. These peak days can be explained by the start and end of vacations and holidays in the country. For example, lectures in most universities start on October 1st, which explains the high amount of traffic on September 30th.
Last but not least, we took a look at the yearly data, which provides the aggregated numbers per year for each station. We created a stacked area chart to find out if there was a change in the traffic after that in 2005 the ‘LKW-Maut’ was introduced in Germany.
year_data.csv, all elements with value ‘A’ (Autobahnen) in column
str_kl are excluded, as we only want to check if there was more traffic on country roads. The
DataFrame is then grouped by
land, so each land can be displayed as one layer in the chart.
df_stacked = pd.read_csv('year_data.csv', sep=';') df_stacked = df_stacked[df_stacked['str_cl'] != 'A'] grouped_by_land = df_stacked.groupby(['land'])
In our case, three arrays are used to generate the graph:
rows: areas to be stacked
names: labels for areas
These arrays are filled with data by iterating over the
grouped_by_land object. Note that
names are one-dimensional arrays, while
rows is two-dimensional. Finally we plotted the data to the chart using
ax.stackplot(x, y, ...).
x = [2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016] names =  rows =  for name, land in grouped_by_land: by_year = land.groupby(['year'])['trucks_r1', 'trucks_r2'].sum() names.append(name) data = (by_year['trucks_r1'] + by_year['trucks_r2']) data = data.fillna(method='pad').values rows.append(data) fig, ax = plt.subplots(figsize=(10, 5), dpi=200) ax.stackplot(x, rows, labels=names, colors=colors) ax.legend(loc='upper right', bbox_to_anchor=(1.15, 1.02)) ax.yaxis.set_major_formatter(FuncFormatter(lambda x, p: format(int(x), ','))) plt.margins(0, 0.1) plt.show()
The chart shows that the traffic on country roads developed differently than expected and there is no significant rise in the year 2005, when the LKW Maut was introduced.
Wanna have a closer look? Ideas and suggestions for further visualizations are very welcome! Feel free to check out the notebook on GitHub in case you want to play around with the code.