Python for Coaches Part 4:Wrangling your Training Peaks Data with Pandas

Alan Couzens, M.Sc. (Sports Science)pandaslide.gif

In my last post on Python for Coaches, we went through how to pull training data from Strava's API (application programming interface). While API's provide an ideal route to gather data, especially streaming data that changes over time, not all platforms offer an API. Training Peaks for one, does not offer a public API. They used to but elected to discontinue it (not that I'm bitter about that or anything :-) Never mind, there are other ways to access your data from Training Peaks. We briefly looked at one of those ways to pull data from non-API url's in my last post. This week, we'll look at another - pulling data from csv files.

Like the other posts in the series, you can follow along via an interactive notebook that you can make a copy of and upload your own training peaks csv files to, to analyze your own data here

Many websites that collect data will offer a csv export of that data. Training Peaks, for one, offers a really handy csv export of your data...

If you click on your profile, you'll see there's an export option in the menu that provides the ability to export raw files (see my first post on how to read and analyze these) but also a csv export of your workout and your metrics data.

TP_export.png

Simply enter 2 dates in the boxes of either, click 'export' and you will find a nice csv export of your workout data or your metrics data (resting heart rate, sleep etc) in your downloads folder on your PC.

A csv file, or comma-separated-value file, is just an easy to read text file of data that is, predictably enough, separated by commas :-) You can easily open a csv file with your notepad app on your machine or, more conveniently with any spreadsheet program like Excel. If you use Excel to open your metrics file, here's what you'll see...

TP_metrics.png

A treasure trove of wellness data! Including sleep metrics, heart rate variability, stress, injury and overall feeling. If you open the workouts csv, you'll see a similar wealth of mineable data!

TP_workouts.png

Via its csv exports, Training Peaks offers us a ton of data from the workouts in an easily accessible format! - The workout title, time, planned duration, actual duration, date, coach comments, distance, average power... Oh heck, it's going to take me ages to list out every data field, let's do it the easier way! Now would be a great time to introduce Pandas!

Pandas is a Python library that makes pulling, analyzing, cleaning and processing of data super easy. You've probably heard the expression "garbage in, garbage out". When it comes to working with and modeling data, nothing could be more true. It's essential that before we begin the process of analyzing and building models around data, we make sure that the data is good! I.e. that the data is in a format that we can process and that there are no weird errors or outliers that might pollute things and lead to erroneous conclusions. This is a very important part of data science and Pandas makes it as painless as possible.

Let's introduce pandas by using it to view all of the available fields in our workouts csv

In [123]:
import pandas as pd

df = pd.read_csv('workouts.csv')
list(df.columns)
Out[123]:
['Title',
 'WorkoutType',
 'WorkoutDescription',
 'PlannedDuration',
 'PlannedDistanceInMeters',
 'WorkoutDay',
 'CoachComments',
 'DistanceInMeters',
 'PowerAverage',
 'PowerMax',
 'Energy',
 'AthleteComments',
 'TimeTotalInHours',
 'VelocityAverage',
 'VelocityMax',
 'CadenceAverage',
 'CadenceMax',
 'HeartRateAverage',
 'HeartRateMax',
 'TorqueAverage',
 'TorqueMax',
 'IF',
 'TSS',
 'HRZone1Minutes',
 'HRZone2Minutes',
 'HRZone3Minutes',
 'HRZone4Minutes',
 'HRZone5Minutes',
 'HRZone6Minutes',
 'HRZone7Minutes',
 'HRZone8Minutes',
 'HRZone9Minutes',
 'HRZone10Minutes',
 'PWRZone1Minutes',
 'PWRZone2Minutes',
 'PWRZone3Minutes',
 'PWRZone4Minutes',
 'PWRZone5Minutes',
 'PWRZone6Minutes',
 'PWRZone7Minutes',
 'PWRZone8Minutes',
 'PWRZone9Minutes',
 'PWRZone10Minutes',
 'Rpe',
 'Feeling']

In the above code we import the pandas library calling it pd (because we're lazy and we don't want to have to type "pandas" every time we call a function :-), then we create a dataframe from our spreadsheet that we simply call "df" (this is a common convention when working with pandas) by calling pandas "read_csv" function.

You can think of a dataframe as pandas version of a spreadsheet, i.e. it's a 2 dimensional set of data with columns across the top - whose name we accessed with the df.columns call above, and rows down the side of the page. Another convenient way to think of a pandas dataframe is as a python dictionary. You'll remember from the last post that a dictionary is a named list denoted by curly brackets {} with key:value combinations for each field e.g....

In [124]:
my_dictionary = {'first_name':'Alan', 'last_name':'Couzens', 'age':'45'}

One thing that you can do with a dictionary that you'll often see with json formatted data is to combine a dictionary with lists, e.g. if I was to add "favorite songs" to my dictionary above (of course I can't pick just one :-) I could create a list of my favorite songs within my dictionary by doing something like...

In [125]:
my_dictionary['favorite_songs'] = ['Round Here - Counting Crows', 'Let Down - Radiohead']

print(my_dictionary)
{'first_name': 'Alan', 'last_name': 'Couzens', 'age': '45', 'favorite_songs': ['Round Here - Counting Crows', 'Let Down - Radiohead']}

You can see my dictionary now also has a list within it (denoted by the square brackets []) I bring this up because your dataframe is essentially the same structure, it is a dictionary of column names and each column name has a list with all of the values in that column underneath it, e.g. if we want to pull a list of all of the workout names, we can reference the 'Title' column of our dataframe in the same way that we would any dictionary...

In [126]:
df['Title']
Out[126]:
0                           Easy Bike: Easy Spin (Flat)
1                           Easy Run: Easy T-Run (Flat)
2                         Easy Swim: Length 400s (Pool)
3              Aerobic Run: Neg Split/Poker Pace (Flat)
4                                           Zwift - NYC
                             ...                       
892                                    Zwift - Richmond
893                         Easy Swim: Feel 100s (Pool)
894                                 Zwift Run - Watopia
895    Recovery Mobility: Yoga (Standing) ( Home/Class)
896                         Easy Bike: Easy Spin (Flat)
Name: Title, Length: 897, dtype: object

This returns a list of all of the workout titles & if we want to access a particular row within our list, we can do it in the same way that we would access any item within a regular list by passing the index, e.g. the 50th row on the list would be (remember lists begin at index 0)...

In [127]:
df['Title'][49]
Out[127]:
'Long Swim: Pace Change 400s (Pool)'

Cool! So, we have 2 massive python dictionaries - one with the athlete's wellness metrics, the other with the athlete's workouts. What can we do with them?

Well, we can do a lot with them. But, for starters, one of the most important things for a coach to understand about an athlete is the relationship between the training load and the level of fatigue experienced by the athlete, i.e. how much is "too much" for this athlete? At what point does the training start to encroach negatively on the athlete's heart rate variability, soreness, mood and overall wellness?

You can see that, here, with the ability to export both training and wellness data from Training Peaks, we have the ability to answer this question. However, as it stands, the source for each of these components of the data is in 2 separate spots, 2 separate spreadsheets, i.e. the data currently lacks the ability to 'talk with each other' & frankly, in its current format, it's hard to make a lot of sense of it. Sounds like a perfect case for some good ol' data wrangling! :-)

In the words of wikipedia..

Data wrangling is the process of gathering, selecting, and transforming data to answer an analytical question

Which is exactly what we want to do here - we want to investigate the relationship between training load and athlete wellness but in order to do that, we will need to gather, select and transform the data...

To begin with, if we take a look back at the way the metrics csv is formatted, you'll see it's kind of weird. We have 3 columns - a timestamp, a data field (HRV, soreness, mood etc) & the value for that data field. It would be much easier to understand and to work with if the data instead had each row as one day and then columns for HRV, soreness, mood, stress, fatigue etc across the top. So let's start by using pandas to put it in that format...

In [128]:
metrics_df = pd.read_csv('metrics.csv')
print(metrics_df)
                Timestamp             Type Value
0     2020-01-01 09:12:00   Sleep Qualilty     5
1     2020-01-01 09:12:00      Sleep Hours  8.00
2     2020-01-01 09:12:00           Stress     4
3     2020-01-01 09:12:00  Overall Feeling     7
4     2020-01-01 09:12:00             Mood     2
...                   ...              ...   ...
3285  2020-12-30 08:15:00             Mood     2
3286  2020-12-30 08:15:00           Injury     9
3287  2020-12-30 08:15:00         Sickness     6
3288  2020-12-30 08:15:00              HRV    10
3289  2020-12-30 08:15:00            Pulse    42

[3290 rows x 3 columns]

So, the above is the current format of our metrics dataframe. What we want is to group all of the different types together in their own column. There are a few ways that we can do this with pandas, but let's go for the most explicit: We'll pull each of the metrics that we want into its own column and then merge them together...

In [129]:
HRV = metrics_df[metrics_df.Type=='HRV']
HRV = HRV.set_index('Timestamp')
HRV = HRV.drop('Type', axis=1)
HRV = HRV.rename(columns={'Value':'HRV'})
print(HRV)
                      HRV
Timestamp                
2020-01-01 09:12:00   8.7
2020-01-02 08:18:00  10.1
2020-01-03 08:33:00    10
2020-01-04 07:38:00  10.5
2020-01-05 08:47:00   9.6
...                   ...
2020-12-26 09:07:00   9.8
2020-12-27 08:54:00   9.1
2020-12-28 07:53:00   9.7
2020-12-29 06:37:00   9.4
2020-12-30 08:15:00    10

[365 rows x 1 columns]

Above, we're introducing a new very powerful pandas feature called "masking" Masking is a way of searching your dataframe to return only the rows that match a given condition. This condition could be numerical e.g. "Give me all the days that HRV was below 8" or it could be text based e.g. give me all the rows corresponding to a given athlete's name etc. In this case, we want to separate each of the different metric types beginning by selecting only the rows that have "HRV" as their type. We then set the index to timestamp so that we have a common index to build our big metrics dataframe and, finally, we drop the type (since we know all the values are HRV) & rename the column to 'HRV'. Since we want to do this multiple times (for HRV, sleep, mood etc), let's build a simple function to do the same for each of the other key metrics and join them together into our new, improved metrics dataframe...

In [130]:
def build_series(metric):
  new_df = metrics_df[metrics_df.Type==str(metric)]
  new_df = new_df.set_index('Timestamp')
  new_df = new_df.drop('Type', axis=1)
  new_df = new_df.rename(columns={'Value':str(metric)})
  return new_df 

HRV = build_series('HRV')
pulse = build_series('Pulse')
stress = build_series('Stress')
sleep_quality = build_series('Sleep Qualilty')
sleep_hours = build_series('Sleep Hours')
mood = build_series('Mood')
new_improved_metrics = pd.concat([HRV, pulse, stress, sleep_quality, sleep_hours, mood], axis=1)
print(new_improved_metrics)
                      HRV Pulse Stress Sleep Qualilty Sleep Hours Mood
2020-01-01 09:12:00   8.7    47      4              5        8.00    2
2020-01-02 08:18:00  10.1    41      3              5        9.25    2
2020-01-03 08:33:00    10    37      3              5        8.75    2
2020-01-04 07:38:00  10.5    38      3              5        8.50    2
2020-01-05 08:47:00   9.6    45      3              5        8.83    2
...                   ...   ...    ...            ...         ...  ...
2020-12-26 09:07:00   9.8    42      3              5        9.50    2
2020-12-27 08:54:00   9.1    37      3              5        9.67    2
2020-12-28 07:53:00   9.7    44      3              5        8.67    2
2020-12-29 06:37:00   9.4    37      4              5        9.33    2
2020-12-30 08:15:00    10    42      4              5        8.50    2

[365 rows x 6 columns]

Ahh, much better! We built our little function above to separate all of the values into their own column and then we "concatenated", i.e. joined each of these columns together into its own big dataframe that is formatted the way we want, i.e. with each day as a row and then the various wellness features each having their own column for the day. Beautiful!

Next, let's bring our training data into the mix. Let's start by adding up the total TSS for a given date.

In [131]:
workouts_df = pd.read_csv('workouts.csv')
TSS_day = workouts_df.groupby(['WorkoutDay'])['TSS'].sum()
print(TSS_day)
WorkoutDay
2020-01-01     75.00
2020-01-02     42.20
2020-01-03     73.98
2020-01-04    197.00
2020-01-05     74.73
               ...  
2020-12-27     73.82
2020-12-28      0.00
2020-12-29     71.20
2020-12-30     94.44
2020-12-31    123.76
Name: TSS, Length: 365, dtype: float64

Here we return to our workouts dataframe (the exported csv of workouts) from above and we use pandas grouby function to tell it to group all of the workouts with the same WorkoutDay together and then add up the total TSS for that day. This gives us a nice series with the workout date as the index and the total TSS for that day as a column.

Now, let's add that daily training load to our new_improved_metrics dataframe.

In order to bring these 2 dataframes together, we want a common index. The first thing you'll notice if you look at the metrics and workouts dataframes is that the index (the first column) is a little different. For our training load we have just the date in Y-m-d format. For our metrics dataframe, we have the whole timestamp. We want to make these the same if we're going to be able to join the two dataframes together. Let's do that...

In [132]:
new_improved_metrics['Date'] = new_improved_metrics.index
new_improved_metrics['Date'].astype(str)
new_improved_metrics['Date'] = new_improved_metrics['Date'].str.slice(0,10)
print(new_improved_metrics['Date'])
new_improved_metrics = new_improved_metrics.set_index('Date')
2020-01-01 09:12:00    2020-01-01
2020-01-02 08:18:00    2020-01-02
2020-01-03 08:33:00    2020-01-03
2020-01-04 07:38:00    2020-01-04
2020-01-05 08:47:00    2020-01-05
                          ...    
2020-12-26 09:07:00    2020-12-26
2020-12-27 08:54:00    2020-12-27
2020-12-28 07:53:00    2020-12-28
2020-12-29 06:37:00    2020-12-29
2020-12-30 08:15:00    2020-12-30
Name: Date, Length: 365, dtype: object

Easy enough.. we used pandas to pull the index into its own column called 'Date'. Then we converted the datetime object to a string. Then we sliced the string so that it contained only the first 10 characters yyyy-mm-dd. Then we set that chopped down date as the index for our dataframe so that it matches the index of our workouts dataframe. Now let's use our old favorite concat function to join these 2 dataframes of training load and wellness metrics into one...

In [133]:
new_improved_metrics = pd.concat([TSS_day, new_improved_metrics], axis=1)
print(new_improved_metrics)
               TSS   HRV Pulse Stress Sleep Qualilty Sleep Hours Mood
2020-01-01   75.00   8.7    47      4              5        8.00    2
2020-01-02   42.20  10.1    41      3              5        9.25    2
2020-01-03   73.98    10    37      3              5        8.75    2
2020-01-04  197.00  10.5    38      3              5        8.50    2
2020-01-05   74.73   9.6    45      3              5        8.83    2
...            ...   ...   ...    ...            ...         ...  ...
2020-12-28    0.00   9.7    44      3              5        8.67    2
2020-12-29   71.20   9.4    37      4              5        9.33    2
2020-12-30   94.44    10    42      4              5        8.50    2
2020-12-31  123.76   NaN   NaN    NaN            NaN         NaN  NaN
2020-08-31     NaN   9.2    40      3              5       11.00    2

[366 rows x 7 columns]

Now that we have our new and improved dataframe that contains both athlete wellness and training load data, we can have a bit of fun analyzing the data...

In [136]:
new_improved_metrics.info()
<class 'pandas.core.frame.DataFrame'>
Index: 366 entries, 2020-01-01 to 2020-08-31
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   TSS             365 non-null    float64
 1   HRV             365 non-null    object 
 2   Pulse           365 non-null    object 
 3   Stress          365 non-null    object 
 4   Sleep Qualilty  362 non-null    object 
 5   Sleep Hours     363 non-null    object 
 6   Mood            362 non-null    object 
dtypes: float64(1), object(6)
memory usage: 22.9+ KB

To start with, the info() method returns a count of how many non-null values we have for each. For this athlete, he has completed 365 days worth of values for TSS, HRV, Pulse and Stress and only a couple missing for Sleep Quality, Hours and Mood, A commendable job of data collection! :-)

One little problem that we see immediately when we look at the data info is that all of the data apart from TSS has Dtype "object". This is the standard data type for when we are dealing with text but, if we want to be able to perform calculations on the data, we need to change it to a 'float' (a number in decimal format) so let's start by doing that...

In [137]:
metrics = ['HRV', 'Pulse', 'Stress', 'Sleep Qualilty', 'Sleep Hours', 'Mood']
for metric in metrics:
  new_improved_metrics[metric] = new_improved_metrics[metric].astype(float)

new_improved_metrics.info()
<class 'pandas.core.frame.DataFrame'>
Index: 366 entries, 2020-01-01 to 2020-08-31
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   TSS             365 non-null    float64
 1   HRV             365 non-null    float64
 2   Pulse           365 non-null    float64
 3   Stress          365 non-null    float64
 4   Sleep Qualilty  362 non-null    float64
 5   Sleep Hours     363 non-null    float64
 6   Mood            362 non-null    float64
dtypes: float64(7)
memory usage: 22.9+ KB

Perfect! Now all of our data is a number. We need numbers to compute statisitics so, that was a good first step. Let's compute some basic descriptive statistics for our data now..

In [138]:
new_improved_metrics.describe()
Out[138]:
TSS HRV Pulse Stress Sleep Qualilty Sleep Hours Mood
count 365.000000 365.000000 365.000000 365.000000 362.000000 363.000000 362.000000
mean 152.167178 9.781644 37.460274 3.104110 4.961326 8.148623 1.988950
std 72.503683 0.404463 2.820974 0.331678 0.330611 0.887075 0.148453
min 0.000000 7.500000 32.000000 3.000000 2.000000 4.000000 1.000000
25% 98.620000 9.500000 36.000000 3.000000 5.000000 7.670000 2.000000
50% 144.070000 9.800000 37.000000 3.000000 5.000000 8.000000 2.000000
75% 197.000000 10.100000 39.000000 3.000000 5.000000 8.670000 2.000000
max 386.910000 10.800000 59.000000 5.000000 6.000000 12.000000 3.000000

This, in itself is a really handy feature of pandas. Not only does this provide a really easy way to identify funky data, e.g. if we saw that our max pulse was 200, we would know that we need to go in and clean the funky values before applying to any models. But, also, we can get a lot of descriptive data about this athlete. We can see that he averaged 152 TSS/day of training load, with a standadrd deviation of ~72TSS/d. We can also see that the athlete has a mean HRV score of ~9.8 and a standard deviation of only 0.4, i.e. all scores are narrowly dispersed. The stats go a step further in explaining this dispersion by setting quartiles, i.e. the cut off for the bottom 25% of scores is a HRV of 9.5, and the top 25% of scores are above an HRV of 10.1. This kind of context is very handy when assessing just how low a score needs to be before raising the red flag. If you run this same describe function on many other athletes, you'll see a much wider range of HRV scores. This is the reason that sticking to arbitrary cut-offs when changing the plan in accordance with HRV is a poor strategy.

Another neat way to get a quick overview of the data is to plot it.

In [139]:
import matplotlib.pyplot as plt
new_improved_metrics.hist(figsize=(20,15))
plt.show()

Here we have a frequency histogram of the data showing how many occurences for the year occured in each 'bin'. Again, very helpful in providing context. Clearly, this athlete is a very stable athlete, meaning any excursions from the norm - a rating of poor sleep, or a bad mood etc are worthy of taking note!

On the training side, we can see that the vast majority of days were between 80 and 150 TSS and and impressive 50 days, i.e. one day each week was between 225 and 275 TSS! It seems with consistency of training load comes a nice, stable physiology!

Another really useful thing to do in the sense of determining which features are important is to see which features are correlated. This couldn't be easier with Pandas..

In [141]:
corr_matrix = new_improved_metrics.corr()
corr_matrix['HRV'].sort_values(ascending=False)
Out[141]:
HRV               1.000000
Sleep Qualilty    0.318967
Mood              0.129927
TSS               0.101500
Pulse            -0.025802
Sleep Hours      -0.078097
Stress           -0.163880
Name: HRV, dtype: float64

We can see that for this athlete, his sleep quality is most strongly correlated with a high HRV. Mood is also positively correlated & Stress is negatively correlated, i.e. as self reported stress goes up, HRV goes down.

It may come as a surprise that TSS is positively correlated with HRV but remember, this is the TSS of training done for that day after the morning metrics are collected. Let's see if the TSS performed on the day before creates more of a negative shift

In [143]:
new_improved_metrics['yesterday_TSS'] = new_improved_metrics['TSS'].shift(1)
print(new_improved_metrics)
               TSS   HRV  Pulse  ...  Sleep Hours  Mood  yesterday_TSS
2020-01-01   75.00   8.7   47.0  ...         8.00   2.0            NaN
2020-01-02   42.20  10.1   41.0  ...         9.25   2.0          75.00
2020-01-03   73.98  10.0   37.0  ...         8.75   2.0          42.20
2020-01-04  197.00  10.5   38.0  ...         8.50   2.0          73.98
2020-01-05   74.73   9.6   45.0  ...         8.83   2.0         197.00
...            ...   ...    ...  ...          ...   ...            ...
2020-12-28    0.00   9.7   44.0  ...         8.67   2.0          73.82
2020-12-29   71.20   9.4   37.0  ...         9.33   2.0           0.00
2020-12-30   94.44  10.0   42.0  ...         8.50   2.0          71.20
2020-12-31  123.76   NaN    NaN  ...          NaN   NaN          94.44
2020-08-31     NaN   9.2   40.0  ...        11.00   2.0         123.76

[366 rows x 8 columns]

Here we simply apply pandas shift function to create a new column called "yesterday_TSS" that moves all training load forward by one day. Let's see if yesterday's TSS is better correlated...

In [144]:
corr_matrix = new_improved_metrics.corr()
corr_matrix['HRV'].sort_values(ascending=False)
Out[144]:
HRV               1.000000
Sleep Qualilty    0.318967
Mood              0.129927
TSS               0.101500
Pulse            -0.025802
yesterday_TSS    -0.039875
Sleep Hours      -0.078097
Stress           -0.163880
Name: HRV, dtype: float64

As expected, it's now slightly negatively correlated, i.e. higher training load on the day before leads to slightly lower HRV.

A couple of other useful features to add might be the long term dynamics of chronic and acute training load, i.e. CTL and ATL. Again, this is super easy with pandas..

In [157]:
import math

def calc_ctl(TSS:list, start_ctl, exponent):
  ctl = [start_ctl]
  for i in range(len(TSS)):
    ctl_value = TSS[i] * (1-math.exp(-1/exponent)) + ctl[-1] * math.exp(-1/exponent)
    ctl.append(ctl_value)
  return ctl

ctl = calc_ctl(new_improved_metrics['TSS'], 103, 42)
atl = calc_ctl(new_improved_metrics['TSS'], 50, 7)

tsb = []
for i in range(len(ctl)):
  tsb.append(ctl[i]-atl[i])

print(ctl)
print(atl)
print(tsb)
[103, 102.34120722626281, 100.92618605753093, 100.29218774844239, 102.56755945827307, 101.91258863650201, 101.09091902814758, 102.25201966867715, 103.29945265952179, 105.33278237703321, 105.48988604555754, 106.1260943196367, 109.37968138343594, 109.12652422832443, 110.02021134831733, 111.29708794103065, 113.24577136832778, 116.1574996376837, 117.71348582867891, 119.50084977803856, 120.98029012750153, 120.66476861929735, 121.8810702218073, 122.46266494724982, 125.85538503028239, 126.4794644827185, 126.58935430587093, 125.6225911025054, 126.15638756809433, 127.846746593427, 127.51907373566205, 128.48187411128984, 131.49557167332634, 133.80709736766752, 134.1949122656816, 133.50353484219832, 135.56029682618433, 135.14901492504606, 135.04104314393206, 136.11485082170637, 138.7562604296059, 137.4926427054819, 138.06855363783234, 138.67561815266677, 140.641041264082, 140.7478353074932, 144.59358905574763, 146.9804520004493, 146.06683695834772, 145.3196521474581, 146.7179479940989, 146.18696224207787, 146.32843887861816, 149.91701396097105, 150.68081326870117, 149.0215774765765, 149.05377415049975, 146.40557368381414, 145.41772395009602, 144.92485933707587, 145.78235202659624, 149.3357778522892, 148.00346880742006, 149.31626707703697, 149.26294563511874, 151.15408215618532, 149.15219721030832, 151.90872270547888, 152.72847919510045, 153.91387140353464, 156.5397753686823, 155.3141908115937, 155.02139999081928, 155.11830370238937...]
[50, 53.32805250624546, 51.84666278492381, 54.79309912053221, 73.72398043562471, 73.85790387292676, 72.94363408504888, 83.26011307737562, 91.71468261114384, 104.76135629759843, 105.72631097122475, 109.29447434917833, 127.28133690017393, 123.46587953605516, 126.61343451077519, 131.62901803890063, 139.9479488042357, 152.86772169296873, 156.78447717533538, 161.69608666356984, 164.44957264830066, 156.87764447604675, 158.93868816020978, 157.29613607799507, 171.8548833347277, 169.26234596308856, 164.18875042303455, 153.7135387006864, 152.99420828006984, 158.98547379723283, 152.986263276592, 155.04349716591983, 168.55891845421172, 176.70347252514247, 173.18725558770464, 164.08473171455077, 171.65076199697168, 164.51930964795167, 159.9985750245824, 162.75172975346905, 174.15074492944626, 162.289466630615, 162.24693856178783, 162.46300390927712, 170.41664951249186, 167.0570937261204, 185.31384544366327, 193.3978386782979, 182.04946735168517, 173.03184718802484, 177.25423430639967, 170.18488710337328, 167.79070115941636, 185.23759027834345, 184.85718185883692, 170.91967311955136, 168.18671991643444, 150.6563071861491, 144.50123286573316, 141.8346330760536, 147.09766415511655, 167.0276847360936, 157.13435845832095, 163.34658666985445, 161.17715082546025, 170.29106909666152, 156.41694148823248, 171.04613624019515, 173.1366675659374, 177.12678053104582, 188.89387155459616, 177.65252602002244, 173.0222031042431, 171.1741750709117, 167.56601035993316, 181.06911609753718, 172.1447281637416, 166.68875507241643, 168.0893652944191, 173.7125273023114, 163.39655931416132, 181.52425373650422, 179.11151501364017, 171.57660245770276...]
[53, 49.01315472001735, 49.07952327260712, 45.499088627910176, 28.843579022648356, 28.054684763575253, 28.147284943098697, 18.99190659130153, 11.584770048377948, 0.5714260794347865, -0.23642492566720819, -3.1683800295416233, -17.90165551673799, -14.339355307730727, -16.59322316245786, -20.33193009786997, -26.702177435907913, -36.71022205528503, -39.07099134665647, -42.19523688553129, -43.469282520799126, -36.212875856749406, -37.05761793840247, -34.83347113074525, -45.99949830444531, -42.782881480370065, -37.59939611716362, -28.090947598181003, -26.83782071197551, -31.13872720380583, -25.467189540929965, -26.56162305462999, -37.06334678088538, -42.89637515747495, -38.99234332202303, -30.58119687235245, -36.090465170787354, -29.370294722905612, -24.95753188065035, -26.63687893176268, -35.39448449984036, -24.79682392513311, -24.17838492395549, -23.787385756610348...]

Excellent! We built a quick function above to calculate the exponentially weighted chronic and acute training load from the TSS values that we pulled from TP. We supply the starting value for each, along with the exponent that we wish to use (42 & 7 are the default numbers in training peaks). We then return a list of ctl, atl and tsb numbers that we can add to our dataframe....

In [158]:
new_improved_metrics['ctl'] = ctl[1:]
new_improved_metrics['atl'] = atl[1:]
new_improved_metrics['tsb'] = tsb[1:]
print(new_improved_metrics)
               TSS   HRV  Pulse  ...         ctl        atl        tsb
2020-01-01   75.00   8.7   47.0  ...  102.341207  53.328053  49.013155
2020-01-02   42.20  10.1   41.0  ...  100.926186  51.846663  49.079523
2020-01-03   73.98  10.0   37.0  ...  100.292188  54.793099  45.499089
2020-01-04  197.00  10.5   38.0  ...  102.567559  73.723980  28.843579
2020-01-05   74.73   9.6   45.0  ...  101.912589  73.857904  28.054685
...            ...   ...    ...  ...         ...        ...        ...
2020-12-28    0.00   9.7   44.0  ...   94.763281  55.313371  39.449910
2020-12-29   71.20   9.4   37.0  ...   94.208877  57.428233  36.780644
2020-12-30   94.44  10.0   42.0  ...   94.214315  62.355317  31.858998
2020-12-31  123.76   NaN    NaN  ...   94.909475  70.529637  24.379838
2020-08-31     NaN   9.2   40.0  ...         NaN        NaN        NaN

[366 rows x 11 columns]
In [159]:
new_improved_metrics.describe()
Out[159]:
TSS HRV Pulse Stress Sleep Qualilty Sleep Hours Mood yesterday_TSS ctl atl tsb
count 365.000000 365.000000 365.000000 365.000000 362.000000 363.000000 362.000000 365.000000 365.000000 365.000000 365.000000
mean 152.167178 9.781644 37.460274 3.104110 4.961326 8.148623 1.988950 152.167178 153.087104 151.800912 1.286191
std 72.503683 0.404463 2.820974 0.331678 0.330611 0.887075 0.148453 72.503683 27.420354 39.167613 21.714522
min 0.000000 7.500000 32.000000 3.000000 2.000000 4.000000 1.000000 0.000000 94.208877 51.846663 -46.417387
25% 98.620000 9.500000 36.000000 3.000000 5.000000 7.670000 2.000000 98.620000 131.755459 120.575143 -16.539083
50% 144.070000 9.800000 37.000000 3.000000 5.000000 8.000000 2.000000 144.070000 156.520107 163.396559 2.238299
75% 197.000000 10.100000 39.000000 3.000000 5.000000 8.670000 2.000000 197.000000 181.038193 180.797980 18.195925
max 386.910000 10.800000 59.000000 5.000000 6.000000 12.000000 3.000000 386.910000 191.652354 226.092983 49.079523

You can see that we now have the additional context added by the long term dynamics of chronic load, acute load and training stress balance to compare against the athlete's wellness scores - all in the one place!

In this case, the athlete peaked out at 191CTL coming from a start point of 94. A solid year all things considered!

We can also run our correlation analysis again to see how the longer term load dynamics relate to HRV (or anything else for that matter)

In [155]:
corr_matrix = new_improved_metrics.corr()
corr_matrix['HRV'].sort_values(ascending=False)
Out[155]:
HRV               1.000000
Sleep Qualilty    0.318967
Mood              0.129927
TSS               0.101500
tsb              -0.012778
Pulse            -0.025802
atl              -0.032126
yesterday_TSS    -0.039875
ctl              -0.048263
Sleep Hours      -0.078097
Stress           -0.163880
Name: HRV, dtype: float64

Adding CTL, ATL and TSB to the correlation matrix we see a slight negative correlation with HRV for each, i.e. as absolute chronic load goes up, the athlete's HRV tends to go slightly down. Though, it's worth noting (for this athlete) that the power of this relationship is not even close to the relationship between self reported stress or sleep quality!

Hopefully, by now you're seeing the power of Pandas in pulling all of your data together from various places so that we can clearly see the relationships. Being able to pull the data together & 'see' the relationships is only the first step in developing powerful individual models that allow us to use machine learning to predict fitness and fatigue for an individual athlete & pandas is perfectly suited to talk to some of these machine learning libraries that I will introduce in future posts.

But, for now, we started with a couple of messy csv's so let's end by saving our dataframe into a much more useful csv that we can use to feed our machine learning models in future posts...

In [156]:
new_improved_metrics.to_csv('new_improved_metrics.csv')

tp_reformat_csv.png

Much better! Long and short term training load & wellness data all wrangled together, ready for the fun stuff of building some models. We'll tackle that next time around. Until then...

Code smart,

AC