Evan Savage

Self-Tracking For Panic: A Deeper Look

In this post, I apply three statistical and machine learning tools to my panic recovery journal data: linear regression/correlation, the Fast Fourier Transform, and maximum entropy modelling.

First, A Word About Tools #

I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.

Now, A Necessary Disclaimer #

My experiment has fewer than 50 samples, which is nowhere near enough to draw statistically significant conclusions. That's not the point. The primary purpose of this post is to demonstrate analysis techniques by example. These same methods can be wielded on larger datasets, where they are much more useful.

Getting Ready #

To follow along with the examples here, you'll need the excellent Python toolkits scipy, matplotlib, and nltk:

$ pip install scipy nltk matplotlib

Linear Regression #

What? #

Linear regression answers this question:

What is the line that most closely fits this data?

Given points $ P_i = (x_i, y_i) $, the goal is to find the line $ y = mx + b $ such that some error function is minimized. A common one is the least squares function:

f(m, b) = \sum_{i} \left(y_i - (mx_i + b)\right)^2

The Pearson correlation coefficient $ R $ and p-value $ p $ are also useful here, as they measure correlation and statistical significance.

Why? #

In a self-tracking context, you might ask the following questions:

Linear regression can help address both questions. However, it can only find linear relationships between datasets. Many dynamic processes are locally linear but not globally linear. For instance, there are practical limits to how much you can exercise in a day, so no linear model with non-zero slope will accurately capture your exercise duration for all time.

The Data #

You can see the code for this analysis here. I look at only the first 31 days, that being the largest consecutive run for which I have data.

Alcohol Consumption

My alcohol consumption did not decrease over time, but rather stayed fairly constant: with $ R = 0.0098 $, there is no correlation between alcohol and time.

Sugar Consumption

Sugar consumption is a similar story: although the best-fit slope is slightly negative, $ R = -0.0671 $ indicates no correlation over time. It seems that my alcohol and sugar consumption were not modified significantly over the tracking period.

Alcohol and Sugar Consumption

I decided to graph alcohol and sugar together. It looks like they might be related, as the peaks in each seem to coincide on several occasions. Let's test this hypothesis:

Alcohol vs. Sugar Consumption

The positive slope is more pronounced this time, but $ R = 0.1624 $ still indicates a small degree of correlation. We can also look at the p-value: with $ p = 0.3827 $, it is fairly easy to write this off as a random effect.

Finally, let's take another look at a question from a previous blog post:

On days where I drink heavily, do I drink less the day after?
Alcohol Consumption: Today vs. Yesterday

There's a negative slope there, but the correlation and p-value statistics are in the same uncertain zone as before. I likely need more data to investigate these last two effects properly.

Fast Fourier Transform #

What? #

Fourier analysis answers this question:

What frequencies comprise this signal?

Given a sequence $ x_n $, a Discrete Fourier Transform (DFT) computes

X_k = \sum_{n=0}^{N-1} x_n \cdot e^{\frac{-2 i \pi k n}{N}}

The $ X_k $ encode the amplitude and phase of frequencies $ \frac{f k}{N} $ Hz, where $ T $ is the time between samples and $ f = 1 / T $ is the sampling frequency.

As described here, the DFT requires $ \mathcal{O}(N^2) $ time to compute. The Fast Fourier Transform (FFT) uses divide-and-conquer on this sum of complex exponentials to compute the DFT in $ \mathcal{O}(N \log N) $ time. Further speedups are possible for real-world signals that are sparse in the frequency domain.

Why? #

In a self-tracking context, you might ask the following questions:

With the FFT, Fourier analysis can help address these questions. However, it can only find periodic effects. Unlike linear regression, it does not help find trends in your data.

The Data #

You can see the code for this analysis here. Again, I look at the first 31 days to ensure that the frequency analysis is meaningful.

Frequency Strengths

There are some apparent maxima there, but it's hard to tell what they mean. Part of the difficulty is that these are frequencies rather than period lengths, so let's deal with that:

$ python food_fft.py
food_fft.py:32: RuntimeWarning: divide by zero encountered in divide
for strength, phase, period in sorted(zip(FS, FP, 1.0 / Q))[-5:]:
[2.21 days] 3.0461 (phase=-0.67 days)
[-2.21 days] 3.0461 (phase=-0.67 days)
[7.75 days] 3.1116 (phase=-3.67 days)
[-7.75 days] 3.1116 (phase=-3.67 days)
food_fft.py:33: RuntimeWarning: invalid value encountered in double_scalars
phase_days = period * (phase / (2.0 * math.pi))
[inf days] 18.1401 (phase=nan days)

If you're not familiar with the Fourier transform, the last line might be a bit mysterious. That corresponds to $ X_0 $, which is just the sum of the original samples:

X_0 = \sum_{n=0}^{N-1} x_n \cdot e^0 = \sum_{n=0}^{N-1} x_n

Other than that, the most pronounced cycles have period lengths of 2.21 days and 7.75 days. The former might be explained by a see-saw drinking pattern, whereas the latter is likely related to the day-of-week effects we saw in the previous post.

Which day of the week? The phase is -3.67 days, and our sample starts on a Monday, placing the first peak on Thursday. The period is slightly longer than a week, though, and the data runs for 31 days, so these peaks gradually shift to cover the weekend.

There are two caveats:

  1. I have no idea whether a Fourier coefficient of about 3 is significant here. If it isn't, I'm grasping at straws.
  2. Again, the small amount of data means the frequency domain data is sparse. To accurately test for bi-daily or weekly effects, I need more fine-grained period lengths.

Maximum Entropy Modelling #

What? #

Maximum entropy modelling answers this question:

Given observations of a random process, what is the most likely model for that random process?

Given a discrete probability distribution $ p(X = x_k) = p_k $, the entropy of this distribution is given by

H(p) = \sum - p_k \log p_k

(Yes, I'm conflating the concepts of random variables and probability distributions. If you knew that, you probably don't need this explanation.)

This can be thought of as the number of bits needed to encode outcomes in this distribution. For instance, if I have a double-headed coin, I need no bits: I already know the outcome. Given a fair coin, though, I need one bit: heads or tails?

After repeated sampling, we get observed expected values for $ p_k $; let these be $ p'_k $. Since we would like the model to accurately reflect what we already know, we impose the constraints $ p_k = p'_k $. The maximum entropy model is the model that also maximizes $ H(p') $.

This model encodes what is known while remaining maximally noncommittal on what is unknown.

Adam Berger (CMU) provides a more concrete example. If you're interested in learning more, his tutorial is highly recommended reading.

Why? #

In a self-tracking context, you might ask the following questions:

Maximum entropy modelling can help address these questions. It is often used to classify unseen examples, and would be fantastic in a data commons scenario with enough data to provide recommendations to users.

Feature Extraction #

Since I'm now effectively building a classifier, there's an additional step. I need features for my classifier, which I extract from my existing datasets:

train_set = []
dates = set(W).intersection(F)
for ds in dates:
ds_data = {
'relaxation' : bool(int(W[ds]['relaxation'])),
'exercise' : bool(int(W[ds]['exercise'])),
'caffeine' : int(F[ds]['caffeine']) > 0,
'sweets' : int(F[ds]['sweets']) > 1,
'alcohol' : int(F[ds]['alcohol']) > 4,
'supplements' : bool(int(F[ds]['supplements']))
except (ValueError, KeyError):
had_panic = P.get(ds) and 'panic' or 'no-panic'
train_set.append((ds_data, had_panic))

Note that the features listed here are binary. I use my daily goals as thresholds on caffeine, sweets, and alcohol.

(If you know how to get float-valued features working with NLTK, let me know! Otherwise, there's always megam or YASMET.

The Data #

You can see the code for this analysis here. This time I don't care about having consecutive dates, so I use all of the samples.

After building a MaxentClassifier, I print out the most informative features with show_most_informative_features():

  -2.204 exercise==True and label is 'panic'
1.821 caffeine==True and label is 'panic'
-0.867 relaxation==True and label is 'panic'
0.741 alcohol==True and label is 'panic'
-0.615 caffeine==True and label is 'no-panic'
-0.537 supplements==True and label is 'panic'
0.439 sweets==True and label is 'panic'
0.430 exercise==True and label is 'no-panic'
0.284 relaxation==True and label is 'no-panic'
0.233 supplements==True and label is 'no-panic'

Exercise, relaxation breathing, and vitamin supplements help with panic. Caffeine, alcohol, and sweets do not. I knew that already, but this suggests which treatments or dietary factors have greatest impact.

Let's consider the supplements finding more closely. Of the 45 days, I took supplements on all but two. It's dangerous to draw any conclusions from a feature for which there are very few negative samples. This points out some important points about data analysis:

Up Next #

In my next post, I look at a panic recovery dataset gathered using qs-counters, a simple utility I built to reduce friction in self-tracking. I perform these same three analyses on the qs-counters dataset, then compare it to the recovery-journal dataset.