Evan Savage

Self-Tracking For Panic: A Deeper Look

In this post, I apply three statistical and machine learning tools to my panic
recovery journal data: linear regression/correlation, the Fast Fourier
Transform, and maximum entropy modelling.

First, A Word About Tools #

I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.

Now, A Necessary Disclaimer #

My experiment has fewer than 50 samples, which is nowhere near enough to draw
statistically significant conclusions
. That's not the point. The primary
purpose of this post is to demonstrate analysis techniques by example. These
same methods can be wielded on larger datasets, where they are much more
useful.

Getting Ready #

To follow along with the examples here, you'll need
the excellent Python toolkits
scipy,
matplotlib, and
nltk:

$ pip install scipy nltk matplotlib

Linear Regression #

What? #

Linear regression answers this question:

What is the line that most closely fits this data?

Given points $ P_i = (x_i, y_i) $, the goal is to find the line
$ y = mx + b $ such that some error function is minimized.
A common one is the least squares function:

$$
f(m, b) = \sum_{i} \left(y_i - (mx_i + b)\right)^2
$$

The
Pearson correlation coefficient $ R $ and
p-value $ p $
are also useful here, as they measure correlation and statistical
significance
.

Why? #

In a self-tracking context, you might ask the following questions:

Linear regression can help address both questions. However, it can only find
linear relationships between datasets. Many dynamic processes are locally linear
but not globally linear. For instance, there are practical limits to how
much you can exercise in a day, so no linear model with non-zero slope will
accurately capture your exercise duration for all time.

The Data #

You can see the code for this analysis here. I look at only the first
31 days, that being the largest consecutive run for which I have data.

Alcohol Consumption

My alcohol consumption did not decrease over time, but rather stayed fairly
constant: with $ R = 0.0098 $, there is no correlation between alcohol and time.

Sugar Consumption

Sugar consumption is a similar story: although the best-fit slope is slightly
negative, $ R = -0.0671 $ indicates no correlation over time. It seems that my
alcohol and sugar consumption were not modified significantly over the tracking
period.

Alcohol and Sugar Consumption

I decided to graph alcohol and sugar together. It looks like they might be
related, as the peaks in each seem to coincide on several occasions. Let's
test this hypothesis:

Alcohol vs. Sugar Consumption

The positive slope is more pronounced this time, but
$ R = 0.1624 $ still indicates a small degree of correlation. We can also look
at the p-value: with $ p = 0.3827 $, it is fairly easy to write this off as
a random effect.

Finally, let's take another look at a question from
a previous blog post:

On days where I drink heavily, do I drink less the day after?
Alcohol Consumption: Today vs. Yesterday

There's a negative slope there, but the correlation and p-value statistics are
in the same uncertain zone as before. I likely need more data to investigate
these last two effects properly.

Fast Fourier Transform #

What? #

Fourier analysis answers this question:

What frequencies comprise this signal?

Given a sequence $ x_n $, a
Discrete Fourier Transform (DFT)
computes

$$
X_k = \sum_{n=0}^{N-1} x_n \cdot e^{\frac{-2 i \pi k n}{N}}
$$

The $ X_k $ encode the amplitude and phase of frequencies
$ \frac{f k}{N} $ Hz, where $ T $ is the time between samples
and $ f = 1 / T $ is the sampling frequency.

As described here, the DFT requires $ \mathcal{O}(N^2) $ time to
compute. The Fast Fourier Transform (FFT) uses
divide-and-conquer on this sum of complex exponentials to compute the DFT in
$ \mathcal{O}(N \log N) $ time.
Further speedups are possible for
real-world signals that are sparse in the frequency domain.

Why? #

In a self-tracking context, you might ask the following questions:

With the FFT, Fourier analysis can help address these questions. However, it
can only find periodic effects. Unlike linear regression, it does not help
find trends in your data.

The Data #

You can see the code for this analysis here. Again, I look at the
first 31 days to ensure that the frequency analysis is meaningful.

Frequency Strengths

There are some apparent maxima there, but it's hard to tell what they
mean. Part of the difficulty is that these are frequencies rather than
period lengths
, so let's deal with that:

$ python food_fft.py
food_fft.py:32: RuntimeWarning: divide by zero encountered in divide
  for strength, phase, period in sorted(zip(FS, FP, 1.0 / Q))[-5:]:
[2.21 days] 3.0461 (phase=-0.67 days)
[-2.21 days] 3.0461 (phase=-0.67 days)
[7.75 days] 3.1116 (phase=-3.67 days)
[-7.75 days] 3.1116 (phase=-3.67 days)
food_fft.py:33: RuntimeWarning: invalid value encountered in double_scalars
  phase_days = period * (phase / (2.0 * math.pi))
[inf days] 18.1401 (phase=nan days)

If you're not familiar with the Fourier transform,
the last line might be a bit mysterious. That corresponds to $ X_0 $, which
is just the sum of the original samples:

$$
X_0 = \sum_{n=0}^{N-1} x_n \cdot e^0 = \sum_{n=0}^{N-1} x_n
$$

Other than that, the most pronounced cycles have period lengths of
2.21 days and 7.75 days. The former might be explained by a see-saw drinking
pattern
, whereas the latter is likely related to the day-of-week effects
we saw in the previous post.

Which day of the week? The phase is -3.67 days, and our sample starts on a
Monday, placing the first peak on Thursday. The period is slightly longer than
a week, though, and the data runs for 31 days, so these peaks gradually shift
to cover the weekend.

There are two caveats:

  1. I have no idea whether a Fourier coefficient of about 3 is significant
    here. If it isn't, I'm grasping at straws.
  2. Again, the small amount of data means the frequency domain data is sparse.
    To accurately test for bi-daily or weekly effects, I need more
    fine-grained period lengths.

Maximum Entropy Modelling #

What? #

Maximum entropy modelling answers this question:

Given observations of a random process, what is the most likely model for that random process?

Given a discrete probability distribution $ p(X = x_k) = p_k $, the entropy
of this distribution is given by

$$
H(p) = \sum - p_k \log p_k
$$

(Yes, I'm conflating the concepts of
random variables and
probability distributions.
If you knew that, you probably don't need this explanation.)

This can be thought of as the number of bits needed to encode outcomes
in this distribution
. For instance, if I have a double-headed coin, I need
no bits: I already know the outcome. Given a fair coin, though, I need one bit:
heads or tails?

After repeated sampling, we get observed expected values for $ p_k $;
let these be $ p'_k $. Since we would like the model to accurately
reflect what we already know
, we impose the constraints $ p_k = p'_k $.
The maximum entropy model is the model that also maximizes $ H(p') $.

This model encodes what is known
while remaining maximally noncommittal on what is unknown.

Adam Berger (CMU) provides a more concrete example.
If you're interested in learning more, his tutorial is highly recommended
reading.

Why? #

In a self-tracking context, you might ask the following questions:

Maximum entropy modelling can help address these questions. It is often
used to classify unseen examples, and would be fantastic in a
data commons scenario
with enough data to provide recommendations to users.

Feature Extraction #

Since I'm now effectively building a classifier, there's an additional step.
I need features for my classifier, which I extract from my existing datasets:

train_set = []
dates = set(W).intersection(F)
for ds in dates:
  try:
    ds_data = {
      'relaxation' : bool(int(W[ds]['relaxation'])),
      'exercise' : bool(int(W[ds]['exercise'])),
      'caffeine' : int(F[ds]['caffeine']) > 0,
      'sweets' : int(F[ds]['sweets']) > 1,
      'alcohol' : int(F[ds]['alcohol']) > 4,
      'supplements' : bool(int(F[ds]['supplements']))
    }
  except (ValueError, KeyError):
    continue
  had_panic = P.get(ds) and 'panic' or 'no-panic'
  train_set.append((ds_data, had_panic))

Note that the features listed here are binary. I use my daily goals as
thresholds on caffeine, sweets, and alcohol.

(If you know how to get float-valued features working with NLTK, let me know!
Otherwise, there's always megam or
YASMET.

The Data #

You can see the code for this analysis here.
This time I don't care about having consecutive dates, so I use all of the
samples.

After building a MaxentClassifier, I print out the most informative features
with show_most_informative_features():

  -2.204 exercise==True and label is 'panic'
   1.821 caffeine==True and label is 'panic'
  -0.867 relaxation==True and label is 'panic'
   0.741 alcohol==True and label is 'panic'
  -0.615 caffeine==True and label is 'no-panic'
  -0.537 supplements==True and label is 'panic'
   0.439 sweets==True and label is 'panic'
   0.430 exercise==True and label is 'no-panic'
   0.284 relaxation==True and label is 'no-panic'
   0.233 supplements==True and label is 'no-panic'

Exercise, relaxation breathing, and vitamin supplements help with panic.
Caffeine, alcohol, and sweets do not. I knew that already, but this suggests
which treatments or dietary factors have greatest impact.

Let's consider the supplements finding more closely. Of the 45 days, I took
supplements on all but two. It's dangerous to draw any conclusions from a
feature for which there are very few negative samples.
This points out some important points about data analysis:

Up Next #

In my next post, I look at a panic recovery dataset gathered using
qs-counters, a simple utility I built to reduce friction in
self-tracking. I perform these same three analyses on the
qs-counters dataset, then compare it to the
recovery-journal dataset.