In the last post, we had thought up of some ideas for how to do timing synchronization. We note that this is not exactly how GSM works – with GSM you have special bursts that are used for timing synchronization, and the receiver uses those to keep its clock in sync with the transmitter. Since we’re not actually implementing a GSM receiver (how many GSM networks remain?), we’re going to look at the generic case of doing timing synchronization with a known training sequence in the presence of severe multipath (mostly for training purposes. I think this stuff is neat and I want to get better at it).

We try out various correlations and see what happens (literally, we’re eyeballing stuff here):

```
tiledlayout(2,2)
nexttile
c, lagz] = xcorr(received, modulated_training_sequence(1:end)); plot(abs(c(200:end)))
[title("correlation with full training sequence")
nexttile
c, lagz] = xcorr(received, modulated_training_sequence(9:end)); plot(abs(c(200:end)))
[title("correlation with TS(9:end)")
nexttile
c, lagz] = xcorr(received, modulated_training_sequence(1:end-8)); plot(abs(c(200:end)))
[title("correlation with TS(1:end-8)")
nexttile
c, lagz] = xcorr(received, modulated_training_sequence(9:end-8)); plot(abs(c(200:end)))
[title("correlation with TS(9:end-8)")
copygraphics(gcf)
```

and we get the following plot. Note that the vertical scale is different for each of the subplots.

OK, so as we kinda suspected in the previous post, using a longer correlation template leads to less sidelobe amplitude without obvious widening of the true correlation peak.

Now that we’ve decided to use the full training sequence for doing correlation in time, let’s write out an estimator. First of all, we correlate the received signal with the training sequence:

```
correlation_output = conv(signal, conj(flip(modulated_training_sequence)));
val, uncorrected_offset] = max(correlation_output); [
```

We use the same channel (otherwise the shape of the correlation peak will be different across runs), and we run it a bunch of times with the same training sequence and random data:

```
% signal creation
training_sequence = randi([0 1], 32,1) %[0,1,0,0,0,1,1,1,1,0,1,1,0,1,0,0,0,1,0,0,0,1,1,1,1,0]';
% channel creation
nominal_sample_rate = 1e6 * (13/48);
signal_channel = stdchan("gsmEQx6", nominal_sample_rate, 0);
modulated_training_sequence = minimal_modulation(training_sequence);
average_convolution = zeros(319,1);
average_convolution_energy = zeros(319,1);
for i = 1:1024
data = [randi([0 1],128,1); training_sequence; randi([0 1], 128,1)];
modulated = minimal_modulation(data);
signal_channel(complex(zeros(30,1)));
received = signal_channel(modulated);
awgned = awgn(received,80);
first_convolution = conv(awgned, conj(flip(modulated_training_sequence)));
average_convolution_energy = average_convolution_energy + abs(first_convolution);
average_convolution = average_convolution + first_convolution;
end
figure;
plot(abs(average_convolution));
title("average convolution output");
figure;
plot(abs(average_convolution_energy));
title("average convolution energy output");
function not_really_filtered = minimal_modulation(data)
not_really_filtered = pammod(data,2);
end
```

Output from a single run looks like this:

And on average, we get something like this: the first image is the average of the outputs, the second image is the average of the *energy* of the outputs:

We observe two things:

There isn’t a single sharp correlation peak – it has lots of structure, even when averaged over 1024 runs. Even more disquieting, the structure seems

*constant*across runs on the same channel, even though the data is random.For a single run, there’s a lot of sidelobe energy. It gets averaged away over many runs, but our estimator needs to work on a single run only.

In an actual system, we almost always have additional information about our signal:

- when the signal started (energy detector)
- when we expect it to start (local timing reference)
- where the training sequence lives in the signal (hopefully the implementer has read the standard)

Even if this information is somewhat coarse/inaccurate, it lets us remove some of the irrelevant bits (or rather, samples,) of the received signal before feeding it to the timing estimator, which will reduce how many sidelobes appear.

If the sidelobes are sufficiently far away from the main peak such that we can ignore them without too much additional information about signal structure, **what truly matters is the shape and position of the intended correlation peak, and what’s immediately around it**.

Even if we avoid the sidelobes, it’s unclear *which part* of the correlation peak we should use as our timing estimate. The highest peak? The first peak above a significant threshold? It’s not unambiguous.

Training/synchronization sequences are generally selected to have negligible autocorrelation at non-zero delays – to improve, well, correlating against them. This means that when we correlate the received signal against the training sequence, we’ll get something that looks like the channel impulse response estimate – and the “sharper” the autocorrelation of the training sequence, the better that channel impulse response estimate will be.

If you think of the channel as a tapped delay line, then the **peak** of the correlation output corresponds to the “tap” (delay) with the highest magnitude. This doesn’t necessarily correspond to the first/“earliest” tap with a significant coefficient, nor does it generally correspond to the best timing for the least-squares channel estimation.

Naturally, if we only have a single big tap in our channel, this is probably fine: as long as the single big tap is in the window for the least-squares, we’re home free. However, if we have multiple significant taps – especially if they aren’t all bunched together – and want to do a good job, things get more complicated.

We need a better way to process this hedgehog-looking correlation peak.

Morally, to minimize bit error, we want to find the channel taps that give us the most energy. The more energy, the less the noise can perturb your decisions – that’s the extreme tl;dr of the whole Shannon thing. You don’t care about super-attenuated paths (their contributions are barely distinguishable from noise), and if you have a single path that’s much stronger than the others, you can ignore the others to pretend you don’t have a dispersive channel at all!

With Viterbi detection it’s a tad nontrivial to reason about it, but imagine a rake receiver: if you want to get the most **signal energy** into your decision device, you need to identify the **channel taps** with the most energy. The rake receiver can pick out a finite but arbitrary set of paths, but with Viterbi, we need to decide on a *window* of the channel to use – everything within that window gets used, nothing outside gets used. That window tends to be fairly small: trellis-based detection is a controlled combinatorial explosion.

Since that window is precious, we need to cram as much energy into it as possible, and **this** is precisely why the timing estimator matters here. If our timing is suboptimal, we’re dropping valuable signal energy on the floor. This inspires a timing estimator design: run a window of appropriate size over the coarse channel impulse response (which is generated by the correlation we previously looked at) and pick the offset with the highest **total energy in that window**.

Note that with a more usual non-ISI case, timing matters because we want to sample the signal at the best time, otherwise the eye diagram closes and we get more bit errors. In this receiver architecture, channel estimation + Viterbi take care of that consideration: the fine timing estimate is, in effect, “baked into” the channel estimate.

To find the size of the window, we ask whoever designed our channel estimator / trellis detector what’s the biggest channel they can handle, which as usual, we call \(L\). Here, we’ve decided \(L=8\). Mathematically, we’re taking the absolute value of the first convolution, and convolving that with an \(L\)-long vector \([1,\cdots,1]\):

```
first_convolution = conv(signal, conj(flip(template)));
second_convolution = conv(abs(first_convolution), ones(1,8));
val, uncorrected_offset] = max(second_convolution); [
```

Note the `abs`

`(first_convolution)`

. This is *critical*, and omitting it caused me a lot of sadness and confusion. We want the total *energy* in that window, and if there’s cancellation across channel coefficients/taps then we’re…not getting a total energy.

If we change our code:

```
average_second_convolution = zeros(326,1);
% ...
second_convolution = conv(abs(first_convolution), ones(1,8));
average_second_convolution = average_second_convolution + second_convolution;
% ...
plot(abs(average_second_convolution));
title("average second convolution output");
```

We get something that looks remarkably better. While a single run looks like this:

This looks pretty bad sidelobe-wise, but sidelobes don’t doom us.

When we average over multiple runs (which effectively “averages out” the sidelobes), we see that our new correlation erased the structure in our correlation peak. Note that it’s **not** the averaging over multiple runs that’s done this!

Sometimes a little structure does show up, but it’s not nearly as bad as before:

We try it with a different channel (`gsmTUx12c1`

) to make sure that it’s not a peculiarity of the channel model we had been using:

Convolving once against the training sequence, calculating the energy, then convolving again with a largest-supported-channel-length vector of ones certainly generates extremely aesthetic plots, but we’ve yet to make sure that it actually matches up against the ground truth.

Without any subtlety, we simply generate a loss vector (how much least-square loss for each offset) for each run, sum them all up, and plot them alongside the second convolution output:

```
average_ts_losses = zeros(1,264);
% ...
ts_losses = least_squares_offset(awgned, training_sequence);
average_ts_losses = average_ts_losses + ts_losses;
% ...
average_second_convolution = average_second_convolution / norm(average_second_convolution);
average_ts_losses = average_ts_losses / norm(average_ts_losses);
figure;
plot(abs(average_second_convolution));
hold on;
plot(average_ts_losses);
title("peak = second convolution, dip = least squares loss");
```

And the comparison is *incredibly* encouraging. Note that these are averages, so the sidelobes from the data get averaged away. We see that the convolution-based estimator has more sidelobes, but the main peak is sharp:

To show it’s not a fluke, we show a comparison for a single run:

There’s still the eternal question of the indices, which decidedly do not line up. We investigate:

```
correlation_indices = [];
least_squares_indices = [];
% ...
val, correlation_index] = max(second_convolution);
[correlation_indices = [correlation_indices correlation_index];
val, least_squares_index] = min(ts_losses);
[least_squares_indices = [least_squares_indices least_squares_index];
% ...
```

And in the command window we run:

```
>> sum(least_squares_indices-correlation_indices)/length(least_squares_indices)
ans =
-29.9922
>>
```

The almost-integral offset does seem to be a fluke, but we run it a few more times and see it’s not – it seems to vary a little bit:

```
>> clear; average_convolution_output
ans =
-33.7109
>> clear; average_convolution_output
ans =
-29.7305
>> clear; average_convolution_output
ans =
-27.7578
>> clear; average_convolution_output
ans =
-30.9141
>>
```

The \(\sim30\)-ness was a bit concerning, since the GSM training sequence is \(26\) syms long and our channel/window is \(8\) long and there’s no obvious and morally-upstanding way to get \(\sim30\) out of that, but looking at our source code we see that we indeed did choose a \(32\)-long training sequence (the `%`

is the comment character in MATLAB):

`training_sequence = randi([0 1], 32,1); %[0,1,0,0,0,1,1,1,1,0,1,1,0,1,0,0,0,1,0,0,0,1,1,1,1,0]';`

We seem to have a little error between the least-squares timing estimator (which we’re using as a reference) and our two-correlation-based estimator, which is a bit concerning. We try to figure out what’s going on, and we start off simple:

```
>> min(least_squares_indices)-max(least_squares_indices)
ans =
0
>> least_squares_indices(1)
ans =
142
>>
```

We notice that the least-squares estimator always gives the same index (\(142\)), so whatever is going on, it’s in the correlation-based estimator, and uh, there’s definitely something going on:

```
>> hold on
>> plot(least_squares_indices)
>> plot(correlation_indices)
```

That outlier is what’s skewing the average! While our code didn’t save the raw data for each run (only the outputs of the estimators), it’s clear what happened: most of the time, this estimator doesn’t get tricked by the sidelobes, but when it does, it gets tricked *hard*.

We plot a histogram of the indices:

`>> histogram(correlation_indices, 'BinMethod', 'integers')`

The most common (by far) value is \(174\), which indeed is \(142+32\):

```
>> 174-32
ans =
142
>> least_squares_indices(1) % remember least_squares_indices(n)=142 for all n
ans =
142
>>
```

**We have not tested this estimator in the presence of noise**. Not to sound like a excerpt from a statistical signal processing textbook, but it is essential to test how good your estimators perform in the presence of noise. Graphs with \(E_{b}/N_{0}\) on the x-axis are, strictly speaking, optional, but they *do* look very nice.

We won’t do a full examination of how these estimators work in noise, but we’ll take a quick look.

I ran the same code, except modified with `awgned = awgn(received,4);`

. This adds AWGN such that the signal to noise ratio is \(4\text{dB}\). From cursory inspection of the plot below, we see that both estimators (the correlation-based one more so than the LS one) are more likely to be tricked by sidelobes in higher-noise conditions. Even if we ignore the sidelobe-caused indices, we see some variation/wobble and not a straight line. This represents error which won’t be eliminated by running the estimators on a smaller section of the signal.

Here’s the histograms for the two indices. We definitely see that the huge errors are from sidelobes and not from the correlation peak spreading out because of noise:

If we zoom in to eliminate the sidelobes, we see that the correlation peaks indeed spread out, but approximately the same amount for both estimators:

The least-squares timing estimator we had been using as a reference is excellent, but it’s *incredibly* compute-intensive. Here, we derived and tested an alternate correlation-based estimator which only requires a convolution, a squaring operation, and a second convolution – and the second convolution doesn’t even require any multiplies!

To reduce computational effort and avoid getting tricked by sidelobes, if at all practical, we should use a priori information (a fancy way of saying “when we started receiving it” and/or “when we expect to receive it” alongside “where the training sequence lives in the signal”) about the signal to slice a section of the signal and only run the estimator on that section.

While previously we had wanted to concoct a measure for the “goodness” of a timing estimator that’d make sense for this context (trellis-based detection in an ISI channel, we’ll be looking at Viterbi itself in the next few posts.

Why try and fake it when we’ll learn and use the real thing?

]]>In the last post, we ran a least-squares on *every possible time offset*, calculated the loss, and declared the optimal timing offset to be the one with the lowest loss. This approach is incredibly inefficient, but critically, we know it handles the dispersive channel correctly. In this and the next post, we’ll use it as a gold standard to validate a more efficient approach.

Yep. It’s a correlation. Our intuition indeed tells us to correlate the received signal against the modulated training sequence, and look for the correlation peak. If we *didn’t* have a dispersive channel, the story ends here.

The dispersive channel makes things a bit more subtle! Remember, what’s *received* is not going to look like what’s transmitted, so we have to be careful in our analysis. When we ran the least-squares, we were careful to run it only on the slice of *received* training sequence that was *unaffected* by unknown data. It’s possible we might have to do something similar here – use a subset of the training sequence, and not the whole thing.

If the channel’s delay spread is \(L\) symbol intervals long, we have a couple reasonable choices for the correlation “template”:

- the full training sequence
- the training sequence with \(L\) symbols removed from the beginning
- the training sequence with \(L\) symbols removed from the end
- the training sequence with \(L\) symbols removed from both ends

I learned enough SAGE to calculate the symbolic expressions for the first two cases (with a channel length of 4, a training sequence length of 10, and 10 symbols before and after the training sequence) to see if there was some insight attainable by looking at the output:

```
= list(var('XXXXXXXXXX_%d' % (i)) for i in range(10))
sage: prepend_data = list(var('TS_%d' % (i+10)) for i in range(10))
sage: TS = list(var('XXXXXXXXXX_%d' % (i+20)) for i in range(10))
sage: append_data = prepend_data + TS + append_data
sage: burst = list(var('CHAN_%d' % (i+1)) for i in range(4))
sage: chan = convolution(burst, chan)
sage: received list(reversed(TS)))
sage: convolution(received, list(reversed(TS[4:10]))) sage: convolution(received,
```

We can see that for instance, the correlation has zero outputs unaffected by unknown data (run it yourself if you want to check, i’m not including the output here), but correlating with `TS[4:10]`

– removing the first 4 symbols from the training sequence – has two outputs unaffected by unknown data:

```
*TS_10 + CHAN_3*TS_11 + CHAN_2*TS_12 + CHAN_1*TS_13)*TS_14 + (CHAN_4*TS_11 + CHAN_3*TS_12 + CHAN_2*TS_13 + CHAN_1*TS_14)*TS_15 + (CHAN_4*TS_12 + CHAN_3*TS_13 + CHAN_2*TS_14 + CHAN_1*TS_15)*TS_16 + (CHAN_4*TS_13 + CHAN_3*TS_14 + CHAN_2*TS_15 + CHAN_1*TS_16)*TS_17 + (CHAN_4*TS_14 + CHAN_3*TS_15 + CHAN_2*TS_16 + CHAN_1*TS_17)*TS_18 + (CHAN_4*TS_15 + CHAN_3*TS_16 + CHAN_2*TS_17 + CHAN_1*TS_18)*TS_19,
(CHAN_4
*TS_11 + CHAN_3*TS_12 + CHAN_2*TS_13 + CHAN_1*TS_14)*TS_14 + (CHAN_4*TS_12 + CHAN_3*TS_13 + CHAN_2*TS_14 + CHAN_1*TS_15)*TS_15 + (CHAN_4*TS_13 + CHAN_3*TS_14 + CHAN_2*TS_15 + CHAN_1*TS_16)*TS_16 + (CHAN_4*TS_14 + CHAN_3*TS_15 + CHAN_2*TS_16 + CHAN_1*TS_17)*TS_17 + (CHAN_4*TS_15 + CHAN_3*TS_16 + CHAN_2*TS_17 + CHAN_1*TS_18)*TS_18 + (CHAN_4*TS_16 + CHAN_3*TS_17 + CHAN_2*TS_18 + CHAN_1*TS_19)*TS_19 (CHAN_4
```

This is a curious phenomenon: chopping off symbols from the training sequence – causing the correlation template to have **fewer** symbols – causes *more* output values that don’t depend on unknown data. This makes total sense, since if you have a tiny little template then it’ll have more “alignments” in the un-tainted section of the received signal.

Unfortunately, this goes in the opposite direction of traditional wisdom about correlation: use the *largest* template you can. So we’re left to wonder, is there a “happy medium” between too much contribution from unknown data, and too few symbols in the correlation template?

There is a saving grace: the standard assumption (and in fact, **standard ^{1} practice!**) is that modulators

There’s only so much thinking about it can do. It seems like we’re going to have to figure out some ways to judge our estimators, and do some simulations.

Let’s think of some criteria we can use to evaluate our estimators:

The estimator gives us an estimated timing offset, and the closer it is to the “true” (determined by least-squares offset sweep) timing offset, the happier we are.

For each trial, we will compute an estimated timing offset, a true timing offset, and an error – and we can accumulate the error over multiple trials (for instance, by calculating a mean squared error). We can then generate a graph of how the error changes as a function of signal-to-noise ratios, since it’s possible that noise affects each estimator differently.

The purpose of most^{3} receivers is to ingest RF (or baseband) and output the best possible estimate of the bits the transmitter ingested, and it’s unclear how these timing errors affect downstream signal processing.

If this were a receiver for a nondispersive channel, we’d make the argument that symbol timing error straightforwardly translates into symbol decisions happening at the wrong times. This leads to the RRC condition being violated causing symbol errors from ISI, and presumably more influence from noise, since we’re not capturing the signal at its peak. We could look at the pulse shape and filter responses and try and eyeball how much timing errors affects bit error rate.

However, we fully intend to tackle the nastiest of dispersive channels, and therefore what lies downstream of the synchronization blocks is a channel estimator and a trellis detector. Timing errors will affect both of these in more complicated ways.

We can try and handwave and say that small enough timing errors will be compensated for by the channel estimator and so we shouldn’t worry much, but I am interested in trying to see if we can find a somewhat reasonable way to quantify timing errors.

We haven’t got to trellis detection yet, and there are lots of subtleties and design choices which I don’t yet^{4} understand, but from what I know, the high level operating principle of trellis detection looks like this:

- we do not try and find a sufficiently-magical
^{5}filter that lets us “undo” the effect of the channel and feed it into a normal decision device - instead, we determine which symbols were sent by seeing how well the received signal matches
**what we would expect to receive**for various transmitted symbol sequences - we do this with a local modulator
^{6}that generates a “template” transmitted signal**before the channel**, for any given symbol sequence - the channel estimator told us what the channel looks like, so we can convolve the “template” signal against the channel to see what we’d expect
**to have received**– if it the transmitter had sent that sequence of symbols - we choose the sequence of symbols that matches best

The error in the demodulation process is probably going to be vaguely of the form (with \(\ast\) convolution):

\[(\text{hypothetical transmitted signal}) \ast (\text{estimated channel}) - (\text{actual received signal})\]

Morally we should want to minimize this error, and we can compare how good our timing estimator is by comparing it to an “ideal” timing estimator, with something that looks like this:

- Run the incredibly slow least-squares estimator over all possible offsets (ok we can cheat and cue it to where we know the training sequence lives), obtain an estimated channel with the lowest possible loss.
- Calculate the magnitude of this over the
*whole signal*: \[(\text{original transmitted signal}) \ast (\text{estimated channel with best offset}) - (\text{actual received signal})\] - Run the timing estimator, obtain a timing offset
- Run a least-squares at the estimated timing offset, obtain an estimated channel
- Calculate the magnitude of \[(\text{original transmitted signal}) \ast (\text{estimated channel}) - (\text{actual received signal})\]
- The error of the timing estimator is the difference between the two magnitudes

I haven’t written code for this yet! But it seems reasonable?

Is it narrow/wide? are there multiple sub-peaks? If there’s a single narrow peak (like when we slid the least-squares along the signal), then everything is happy, if the peak has multiple sub-peaks or is wider then we need to be more careful about what’s going on, and figure out how to go from a vector of correlation outputs to a single timing offset.

How high are the other peaks? If they’re high, the likelihood of the estimator choosing the wrong peak increases. To be clear, it’s often reasonable to accept higher sidelobe levels as a tradeoff for a narrower true peak / less errors, but we should be sure that our timing detector won’t accidentally lock on the wrong peak. Common methods of doing this would look like “having a local clock telling us approximately when we’re expecting to receive a burst” (like actual GSM receivers do) or an energy detector that tells us when a burst is starting.

I’m going to write some code to implement a correlation-based timing estimator (parametrizable with how much of the training sequence we’re using as the “template”), along with code to test it against the “ideal” timing estimator.

And most critically, there will be plenty of graphs – of correlation peaks, sidelobes, and most critically, graphs with \(E_{b}/N_{0}\) on the x-axis!

If the transmitted data is trivially distinguishable from random, then the transmitter pays Shannon for energy (joules for feeding the power amplifier) and bandwidth (how many MHz we splatter our signal over) it doesn’t need. Motivating example without calculations: you are sending a hundred bits of data, with each of those bits generated by a random process with \(p(0) = 0.99\) and \(p(1) = 0.01\). This bitstring is quite distinguishable from random (even a low-pass filter can distinguish it). We can transmit the

*same information*in the bitstring with much less energy and bandwidth by compressing it: Run-length encoding converts it into a much smaller bitstring. The compressed bitstring will look a lot closer to random data (and if there’s still obvious ways it differs, then we can compress it further :).↩︎Only a certain amount of statistical indistinguishability. We don’t need full statistical indistinguishability here (error correction schemes necessarily introduce deterministic relationships between transmitted bits) and certainly not something like cryptographic indistinguishability.↩︎

There are, in fact, receivers where accurate channel and/or timing estimation is the primary goal, and the data is of secondary importance. For instance, a GPS receiver designer is mighty concerned about getting timing

*incredibly*right, and an air-defense radar designer is in the business of estimating a channel – where one of the taps might be an enemy aircraft.↩︎Ungerboeck vs Forney observation models is the most salient but there’s lots of other stuff.↩︎

The “sufficiently-magical filter” approach in fact works for well-behaved dispersive channels! Zero-forcing equalization (use a filter that’s the inverse of the channel) can lead to suffering really quick since if you have a null in your channel, you’ll have infinite noise amplification in your equalizer, which is bad. MMSE equalization strikes a balance between noise enhancement and ISI suppression based on the SNR, but with GSM channels we can’t get away with only MMSE. There are ways to design prefilters without introducing much noise to transmogrify the channel’s impulse response to have more of its energy towards the beginning, and this can help make the trellis detector less compute-intensive. Channel-shortening will be a different story, and one we will look at in a later post!↩︎

Usually just lookup tables, sorry to ruin the mystique.↩︎

So now that we have a good enough channel estimation mechanism, we need to figure out how to apply it to something that vaguely looks like a real-world problem. This means no hard coding indices! We’re not going to get away with that in the real world, unless maybe we’ve got a sync cable between the transmitter and receiver…

This means we need to handle a few things:

- Timing offsets: we don’t know exactly when the training sequence starts
- Frequency offsets: local oscillators aren’t perfect
- Phase offsets: transmitter and receiver local oscillators aren’t perfectly in phase

Phase offsets are the easiest to handle, since those get “baked into” the channel estimate. Adding a phase offset \(\phi_{1}\) before the channel and a phase offset \(\phi_{2}\) after the channel is equivalent to multiplying the channel estimate by a complex number with magnitude 1 and phase \(\phi_{1}+\phi_{2}\). With as \(\ast\) convolution:

\[((\phi_{1} \cdot \text{transmitted}) \ast \text{channel})\cdot \phi_{2} = \text{transmitted} \ast ([\phi_{1} + \phi_{2}] \cdot \text{channel})\]

So we don’t even need to estimate a phase offset – the channel estimation handles it.

We handle phase noise with the time-honored tradition of ignoring it. Seriously though, I’m not sure how much it’s a problem. I think if we have a way to update the channel estimate as it’s being used to estimate bits (“per survivor processing”), we should be able to handle it, but I’m nowhere near that yet.

A frequency offset will show up as a changing phase offset, and this won’t be handled by the channel estimation – unless, again, if we have a way to update the channel estimate as it’s being used in the actual demodulation.

Fortunately, we can estimate and compensate for frequency offsets earlier in the receiver and without needing the ability to estimate/update the channel estimate. In fact, we likely can get better results by compensating for frequency error with a mechanism designed for that purpose, that can operate over the entire received burst at once.

In actual GSM, coarse frequency synchronization is handled by the “frequency burst”; which carries no data but is designed to allow easy recovery of the carrier frequency at the mobile station. We’ll look at how to handle frequency offsets in a future post.

Similarly, in actual GSM, coarse time synchronization is handled by the “synchronization burst” – which has an extra-long training sequence and information that identifies the base station.

We will look at how to handle time offsets in this post. The synchronization burst is a special case of a normal burst, and we can use the same methods to handle both.

I spent some time guessing various indices for the least squares and eyeballing “how good” (with a channel generated by `conv(modulated, [1,2,3,4,5,4,3,2]);`

) the channel estimate was, which was kind of elucidating but not very principled.

Fortunately, we have a better tool: the least squares loss function.

We stop using the hardcoded channel `[1,2,3,4,5,4,3,2]`

with its mysteriously-integer-valued coefficients, and use `interference_channel = stdchan("gsmEQx6", nominal_sample_rate, 0);`

instead, and we add some AWGN: `awgned = awgn(received,10);`

. Here’s the code:

```
training_sequence = [0,1,0,0,0,1,1,1,1,0,1,1,0,1,0,0,0,1,0,0,0,1,1,1,1,0]';
data = [randi([0 1],64,1); training_sequence; randi([0 1], 128,1)];
modulated = minimal_modulation(data);
nominal_sample_rate = 1e6 * (13/48);
interference_channel = stdchan("gsmEQx6", nominal_sample_rate, 0);
%received = conv(modulated, [1,2,3,4,5,4,3,2]);
%received = conv(modulated, [1,1,0,0,0,0,0,0]);
received = interference_channel(modulated);
nominal_channel_length = 8
modulated_training_sequence = minimal_modulation(training_sequence);
training_sequence_length = length(training_sequence)
toeplitz_column = modulated_training_sequence(nominal_channel_length:training_sequence_length);
toeplitz_row = flip(modulated_training_sequence(1:nominal_channel_length));
T = toeplitz(toeplitz_column, toeplitz_row);
awgned = received % awgn(received,10);
clean_part_of_training_sequence = training_sequence_length - nominal_channel_length;
losses = ones(1, length(received)-clean_part_of_training_sequence);
for offset = 1:(length(received)-clean_part_of_training_sequence)
interesting_part_of_received_signal = awgned(offset:offset+clean_part_of_training_sequence);
estimated_chan = lsqminnorm(T, interesting_part_of_received_signal)
loss_vector = T* estimated_chan - interesting_part_of_received_signal;
TS_losses(offset) = norm(loss_vector);
end
val, best_offset] = min(TS_losses);
[
interesting_part_of_received_signal = awgned(offset:offset+clean_part_of_training_sequence);
best_estimated_chan = lsqminnorm(T, interesting_part_of_received_signal)
function not_really_filtered = minimal_modulation(data)
not_really_filtered = pammod(data,2);
end
```

The relevant section for what we’re doing today is below. We try every possible offset, and we see which one gives us the smallest loss:

```
losses = ones(1, length(received)-clean_part_of_training_sequence);
for offset = 1:(length(received)-clean_part_of_training_sequence)
interesting_part_of_received_signal = awgned(offset:offset+clean_part_of_training_sequence);
estimated_chan = lsqminnorm(T, interesting_part_of_received_signal)
loss_vector = T* estimated_chan - interesting_part_of_received_signal;
TS_losses(offset) = norm(loss_vector);
end
val, best_offset] = min(TS_losses);
[
interesting_part_of_received_signal = awgned(offset:offset+clean_part_of_training_sequence);
best_estimated_chan = lsqminnorm(T, interesting_part_of_received_signal)
```

Yeah, we redo the least squares calculation, but it’s not a big deal. This is a terribly slow way to do timing synchronization anyway, but it’s excellent to figure out what is going on.

We `plot(TS_losses)`

and get an unambigous and deep correlation dip:

Looking at how we calculated `TS_losses`

, we see that it’s *only* calculated over the training sequence, not over the whole burst. We are curious to see what happens if we calculate the “error” over the whole burst. We’ll do this by convolving the *transmitted* signal (before the channel) with the estimated channel, and then subtracting the received signal:

```
>> tiledlayout(2,1)
>> nexttile
>> plot(abs(received))
>> title("actual received signal")
>> fake_received_signal = conv(best_estimated_chan, modulated);
>> nexttile
>> plot(abs(fake_received_signal))
>> title("fake received signal")
>> copygraphics(gcf)
```

and they don’t look very similar at all :(

Either we made a big mistake, or there’s behavior in the real channel that a convolution with the estimated channel is failing to capture.

We try and plot the difference, praying that there might be something interesting there…and get an error.

```
>> plot(abs(received-fake_received_signal))
Arrays have incompatible sizes for this operation.
```

This leads us to look at the sizes, and more specifically, the difference in sizes:

```
>> size(fake_received_signal)
ans =
225 1
>> size(received)
ans =
218 1
>> 218-225
ans =
-7
```

7 is sus because it’s almost 8, the putative channel size in GSM. We look at the plots again, and we observe that the actual received signal is preceded by samples that look…zero-ish, and exactly 7 of them at that:

```
>> received
received =
-0.0000 + 0.0001i % 1
-0.0014 - 0.0006i % 2
0.0027 + 0.0010i % 3
-0.0003 - 0.0022i % 4
0.0019 + 0.0056i % 5
-0.0040 - 0.0044i % 6
0.0018 + 0.0033i % 7
0.1531 + 0.2801i
0.2313 + 0.2729i
0.3761 - 0.1057i
-0.1923 - 0.1956i
-0.7524 - 0.6256i
```

This hasn’t had AWGN added – this is just from the effect of `stdchan("gsmEQx6", nominal_sample_rate, 0)`

. It looks like this Matlab channel simulation is flushing the channel with some noise, since the modulated signal doesn’t look like this at all:

```
>> modulated
modulated =
1.0000 + 0.0000i
1.0000 + 0.0000i
-1.0000 + 0.0000i
-1.0000 + 0.0000i
-1.0000 + 0.0000i
1.0000 + 0.0000i
-1.0000 + 0.0000i
-1.0000 + 0.0000i
1.0000 + 0.0000i
1.0000 + 0.0000i
```

If we run a cross-correlation between this “fake” received signal (generated by convolving the original modulated signal with the estimated channel) and the actual received signal, we get something remarkably disappointing:

```
>> fake_received_signal = conv(best_estimated_chan, modulated);
>> [c, lagz] = xcorr(received, fake_received_signal);
>> stem(lagz,abs(c))
```

If we zoom in on the center, we see it’s unfortunately nowhere near sharp:

It’s still unclear exactly what is going on – the behavior differences at the beginning/end of the channel simulation fail to explain the catastrophic lack of similarity.

We go back to our contrived, integer-valued channel:

`received = conv(modulated, [1,2,3,4,5,4,3,2]);`

and run this again. To make things really simple, we turn off the noise:

`awgned = received;`

We plot `TS_loss`

and see a perfect zero loss at an offset of 72 (64 + 8):

but the best estimated channel is nowhere near the actual channel, which is `[1,2,3,4,5,4,3,2]`

:

```
>> best_estimated_chan
best_estimated_chan =
1.2873
0.3579
-1.0171
-0.6468
-0.5164
-0.3579
-3.1032
-2.9164
>>
```

So we take a look at our code again, and we find a plain old bug:

```
for offset = 1:(length(received)-clean_part_of_training_sequence)
interesting_part_of_received_signal = awgned(offset:offset+clean_part_of_training_sequence);
estimated_chan = lsqminnorm(T, interesting_part_of_received_signal)
loss_vector = T* estimated_chan - interesting_part_of_received_signal;
TS_losses(offset) = norm(loss_vector);
end
val, best_offset] = min(TS_losses)
[
interesting_part_of_received_signal = awgned(offset:offset+clean_part_of_training_sequence);
best_estimated_chan = lsqminnorm(T, interesting_part_of_received_signal)
```

In the penultimate line, where we slice out the part of the received signal we’ll run a least-squares on, the indices for the slice are`(`

`offset`

`:`

`offset`

`+clean_part_of_training_sequence)`

.

`offset`

is the loop variable, and its value there is quite simply the offset of the *last* least-squares computed in the loop. It is not `best_offset`

, which is the index of the minimum loss.

We fix this, and now the best estimated channel is indeed what we expect it to be:

```
best_estimated_chan =
1.0000
2.0000
3.0000
4.0000
5.0000
4.0000
3.0000
2.0000
```

Out of curiosity, I gave the above excerpt to ChatGPT-4 preceded by the prompt “find the bug:”, and it figures it out perfectly!

GPT-3.5 didn’t clue in on it, and even GPT-4 didn’t clue in on it when fed the whole file’s contents rather than the excerpt.

With this fix in place, we see that the fake received signal (generated by convolving the transmitted signal with the best estimated channel) is indeed identical to the original received signal, at least with the contrived channel:

We go back to the real channel (well, it’s not an actual IRL channel but it’s a simulation of an IRL channel), and see how much better our “fake” (generated by convolving the transmitted signal with the best estimated channel) received signal is:

Now *this* is a specimen!

This is, effectively, **the same signal**, except for subtleties in how the beginning and end are handled. It looks like the channel simulator flushes (seasons?) the beginning of the channel with very low-amplitude noise, and handles the end by “cutting off” the simulation before letting the channel “drain”, whereas the simplistic convolution doesn’t prepend low-amplitude noise at the beginning and lets the channel completely drain out.

If we add sufficient zeros to the beginning and end of the modulated signal, and feed it in both the channel simulation and the convolution:

```
>> received = interference_channel([zeros(16,1);modulated;zeros(16,1)]);
>> tiledlayout(2,1)
>> nexttile
>> plot(abs(received))
>> title("actual received signal")
>> fake_received_signal = conv(best_estimated_chan, [zeros(16,1);modulated;zeros(16,1)]);
>> nexttile
>> plot(abs(fake_received_signal))
>> title("fake received signal")
>> copygraphics(gcf)
```

Since these are complex signals, we should be sure that the real/imaginary parts match, and not just the magnitudes:

```
>> received = interference_channel([zeros(16,1);modulated;zeros(16,1)]);
>> fake_received_signal = conv(best_estimated_chan, [zeros(16,1);modulated;zeros(16,1)]);
>> tiledlayout(2,2)
>> nexttile
>> plot(real(received))
>> title("received signal, real")
>> nexttile
>> plot(imag(received))
>> title("received signal, imag")
>> nexttile
>> plot(real(fake_received_signal))
>> title("fake received signal, real")
>> nexttile
>> plot(imag(fake_received_signal))
>> title("fake received signal, imag")
>> copygraphics(gcf)
```

Looks good!

We’ve actually made a working time synchronization estimator! Unfortunately it’s incredibly inefficient – requiring a whole least-squares estimate for every possible offset. However, it will serve as an ironclad gold standard^{1} and help us build a more efficient synchronization estimator next time.

“ironclad gold standard” is funny if you take it literally (usually you plate cheap metals with gold, not the other way around :D)↩︎

Let’s look at the case when the actual channel is slightly longer than the model. If the channel is **much** longer than the model, then the model will be unable to capture most of the channel’s properties, and the channel estimation will probably be nonsense. But if the channel is only *slightly* longer than the model, then the model will be able to capture most of the channel’s information (with some inherent inaccuracy). In this case, the channel estimation might be good enough for practical purposes.

For a practical example, let’s say that we have a training sequence of length 10, and an assumed channel of length 4. As usual, the training sequence is sandwiched between unknown data symbols.

We call the training sequence symbols \(TS = t_{1}, \cdots, t_{10}\), the data symbols (presumed unknown here) \(d_{i}\) (for \(i<1\) or \(i>10\)), and the channel coefficients \(chan = c_1,\cdots,c_4\). The channel is assumed to be causal, so the channel output is given by convolving the channel coefficients with the transmitted symbols \(d_{-1}, d_{0}, t_{1}, \cdots, t_{10}, d_{11}, d_{12}\).

The observation model is a convolution of the signal against the channel and looks like this (we draw the matrices aligned that way to make it clear how the multiplication works): \[\begin{align*} & \begin{bmatrix} c_{1}\\ c_{2}\\ c_{3}\\ c_{4} \end{bmatrix} & \\ \begin{bmatrix} t_{1} & d_{0} & d_{-1} & d_{-2} \\ t_{2} & d_{1} & d_{0} & d_{-1} \\ t_{3} & t_{2} & t_{1} & d_{0} \\ \hdashline t_{4} & t_{3} & t_{2} & t_{1} \\ t_{5} & t_{4} & t_{3} & t_{2} \\ t_{6} & t_{5} & t_{4} & t_{3} \\ t_{7} & t_{6} & t_{5} & t_{4} \\ t_{8} & t_{7} & t_{6} & t_{5} \\ t_{9} & t_{8} & t_{7} & t_{6} \\ t_{10} & t_{9} & t_{8} & t_{7} \\ \hdashline d_{11} & t_{10} & t_{9} & t_{8} \\ d_{12} & d_{11} & t_{10} & t_{9} \\ d_{13} & d_{12} & d_{11} & t_{10} \\ \end{bmatrix} \ast&& = \begin{bmatrix} {r_{2}} \\ {r_{3}} \\ {r_{4}} \\ \hdashline {r_{5}} \\ {r_{6}} \\ {r_{7}} \\ {r_{8}} \\ {r_{9}} \\ {r_{10}} \\ {r_{11}} \\ \hdashline {r_{12}} \\ {r_{13}} \\ {r_{14}} \\ \end{bmatrix} \end{align*}\]

Everything above and below the dashed lines is influenced by unknown data symbols, so we focus on the middle section of the matrix and the corresponding produced received symbols.

We notice that the least squares process “tries to” estimate the received symbols as a linear combination of the four columns of this matrix:

\[\begin{bmatrix} t_{4} & t_{3} & t_{2} & t_{1} \\ t_{5} & t_{4} & t_{3} & t_{2} \\ t_{6} & t_{5} & t_{4} & t_{3} \\ t_{7} & t_{6} & t_{5} & t_{4} \\ t_{8} & t_{7} & t_{6} & t_{5} \\ t_{9} & t_{8} & t_{7} & t_{6} \\ t_{10} & t_{9} & t_{8} & t_{7} \\ \end{bmatrix}\]

The least squares process will try to find coefficients \([\hat{c}_{1}, \hat{c}_{2}, \hat{c}_{3}, \hat{c}_{4}]\) such that the received symbols are well approximated by the linear combination of the four columns. The first received symbol not affected by unknown data – the first received symbol we ingest for the least-squares – is \(r_5 = c_1 \cdot t_4 + c_2 \cdot t_3 + c_3 \cdot t_2 + c_4 \cdot t_1\), and this makes sense since the first row of the observation matrix is \([t_4, t_3, t_2, t_1]\).

If we add an extra channel coefficient \(c_5\) , then the observation model looks like this. We bold the \(\mathbf{d_{0}}\) in the fifth column since it is between the dashed lines but is an unknown data symbol. The red terms are the terms added by the extra channel coefficient.

\[\begin{align*} & \begin{bmatrix} c_{1}\\ c_{2}\\ c_{3}\\ c_{4} \\ \color{red}c_{5} \end{bmatrix} & \\ \begin{bmatrix} t_{1} & d_{0} & d_{-1} & d_{-2} & \color{red} d_{-3} \\ t_{2} & d_{1} & d_{0} & d_{-1} & \color{red} d_{-2} \\ t_{3} & t_{2} & t_{1} & d_{0} & \color{red} d_{-1}\\ \hdashline t_{4} & t_{3} & t_{2} & t_{1} & \color{red} \mathbf{d_{0}}\\ t_{5} & t_{4} & t_{3} & t_{2} & \color{red} t_{1} \\ t_{6} & t_{5} & t_{4} & t_{3} & \color{red} t_{2}\\ t_{7} & t_{6} & t_{5} & t_{4} & \color{red} t_{3}\\ t_{8} & t_{7} & t_{6} & t_{5} & \color{red}t_{4}\\ t_{9} & t_{8} & t_{7} & t_{6} & \color{red}t_{5}\\ t_{10} & t_{9} & t_{8} & t_{7} & \color{red}t_{6}\\ \hdashline d_{11} & t_{10} & t_{9} & t_{8} & \color{red}t_{7}\\ d_{12} & d_{11} & t_{10} & t_{9} & \color{red}t_{8}\\ d_{13} & d_{12} & d_{11} & t_{10} & \color{red}t_{9}\\ \end{bmatrix} \ast&& = \begin{bmatrix} {r_{2}} \\ {r_{3}} \\ {r_{4}} \\ \hdashline {r_{5}} \\ {r_{6}} \\ {r_{7}} \\ {r_{8}} \\ {r_{9}} \\ {r_{10}} \\ {r_{11}} \\ \hdashline {r_{12}} \\ {r_{13}} \\ {r_{14}} \\ \end{bmatrix} + c_{5} \begin{bmatrix} \color{red}d_{-3} \\ \color{red}d_{-2} \\ \color{red}d_{-1}\\ \hdashline \color{red}\mathbf{d_{0}}\\ \color{red}t_{1} \\ \color{red}t_{2}\\ \color{red}t_{3}\\ \color{red}t_{4}\\ \color{red}t_{5}\\ \color{red}t_{6}\\ \hdashline \color{red}t_{7}\\ \color{red}t_{8}\\ \color{red}t_{9} \end{bmatrix} \end{align*}\]

If \(c_5\) is significantly less than the other coefficients, we can treat its contribution (the new vector on the right) to the received symbols like noise (and hope that it won’t mess up the estimation too much).

If the energy is distributed pretty evenly across all five channel coefficients, we can’t do this, since it’ll significantly worsen the fit. However, if we have the opposite case – \(c_1\) significantly less than the other coefficients – we can do something similar to the previous case. We can treat the *first* column of the observation matrix as noise, and try to estimate the other four columns.

Here’s the observation model if we’re trying to treat \(c_1\) as negligible:

\[\begin{align*} & \begin{bmatrix} \color{red}c_{1}\\ c_{2}\\ c_{3}\\ c_{4} \\ c_{5} \end{bmatrix} & \\ \left[ \begin{array}{c|cccc} \color{red}t_{1} & d_{0} & d_{-1} & d_{-2} & d_{-3} \\ \color{red}t_{2} & d_{1} & d_{0} & d_{-1} & d_{-2} \\ \color{red}t_{3} & t_{2} & t_{1} & d_{0} & d_{-1}\\ \hdashline \color{red}t_{4} & t_{3} & t_{2} & t_{1} & \mathbf{d_{0}}\\ \color{red}t_{5} & t_{4} & t_{3} & t_{2} & t_{1} \\ \color{red}t_{6} & t_{5} & t_{4} & t_{3} & t_{2}\\ \color{red}t_{7} & t_{6} & t_{5} & t_{4} & t_{3}\\ \color{red}t_{8} & t_{7} & t_{6} & t_{5} & t_{4}\\ \color{red}t_{9} & t_{8} & t_{7} & t_{6} & t_{5}\\ \color{red}t_{10} & t_{9} & t_{8} & t_{7} & t_{6}\\ \hdashline \color{red}d_{11} & t_{10} & t_{9} & t_{8} & t_{7}\\ \color{red}d_{12} & d_{11} & t_{10} & t_{9} & t_{8}\\ \color{red}d_{13} & d_{12} & d_{11} & t_{10} & t_{9}\\ \end{array}\right] \ast&& = c_{1} \begin{bmatrix} \color{red}t_{1} \\ \color{red}t_{2} \\ \color{red}t_{3} \\ \hdashline \color{red}t_{4} \\ \color{red}t_{5} \\ \color{red}t_{6} \\ \color{red}t_{7} \\ \color{red}t_{8} \\ \color{red}t_{9} \\ \color{red}t_{10} \\ \hdashline \color{red}d_{11} \\ \color{red}d_{12} \\ \color{red}d_{13} \\ \end{bmatrix} + \begin{bmatrix} {r_{2}} \\ {r_{3}} \\ {r_{4}} \\ \hdashline {r_{5}} \\ {r_{6}} \\ {r_{7}} \\ {r_{8}} \\ {r_{9}} \\ {r_{10}} \\ {r_{11}} \\ \hdashline {r_{12}} \\ {r_{13}} \\ {r_{14}} \\ \end{bmatrix} \end{align*}\]

However, we need to be careful with indices when we do this! Note that if we ignore the stuff in red, the submatrix in between the horizontal dashed lines is **not** the same column vectors as previously! This will cause trouble! In order to get the **same column vectors** (those only containing known training symbols and spanning the maximum possible length) as before *and* have them multiplied with the correct subset of the channel \([c_2,\cdots,c_5]\), we need to shift the indices by one.

Indeed, to have a least-squares that generates a sensible output, the first received symbol we use needs to be \(\textrm{noise} + c_2 \cdot t_4 + c_3 \cdot t_3 + c_4 \cdot t_2 + c_5 \cdot t_1\) (with the “noise” being the contribution from the first channel tap, which we are assuming is negligible). We need this since the first row of the least-squares matrix *still* is \([t_4, t_3, t_2, t_1]\). Looking at the indices of this desired received symbol, we see that this would be \(r_6\), not \(r_5\) in the previous case.

To write the correct observation model (which highlights the matrix we’ll use for least-squares), we simply move the horizontal dashed lines one row down!

\[\begin{align*} & \begin{bmatrix} \color{red}c_{1}\\ c_{2}\\ c_{3}\\ c_{4} \\ c_{5} \end{bmatrix} & \\ \left[ \begin{array}{c|cccc} \color{red}t_{1} & d_{0} & d_{-1} & d_{-2} & d_{-3} \\ \color{red}t_{2} & d_{1} & d_{0} & d_{-1} & d_{-2} \\ \color{red}t_{3} & t_{2} & t_{1} & d_{0} & d_{-1}\\ \color{red}t_{4} & t_{3} & t_{2} & t_{1} & \mathbf{d_{0}}\\ \hdashline \color{red}t_{5} & t_{4} & t_{3} & t_{2} & t_{1} \\ \color{red}t_{6} & t_{5} & t_{4} & t_{3} & t_{2}\\ \color{red}t_{7} & t_{6} & t_{5} & t_{4} & t_{3}\\ \color{red}t_{8} & t_{7} & t_{6} & t_{5} & t_{4}\\ \color{red}t_{9} & t_{8} & t_{7} & t_{6} & t_{5}\\ \color{red}t_{10} & t_{9} & t_{8} & t_{7} & t_{6}\\ \color{red}d_{11} & t_{10} & t_{9} & t_{8} & t_{7}\\ \hdashline \color{red}d_{12} & d_{11} & t_{10} & t_{9} & t_{8}\\ \color{red}d_{13} & d_{12} & d_{11} & t_{10} & t_{9}\\ \end{array}\right] \ast&& = c_{1} \begin{bmatrix} \color{red}t_{1} \\ \color{red}t_{2} \\ \color{red}t_{3} \\ \color{red}t_{4} \\ \hdashline \color{red}t_{5} \\ \color{red}t_{6} \\ \color{red}t_{7} \\ \color{red}t_{8} \\ \color{red}t_{9} \\ \color{red}t_{10} \\ \color{red}d_{11} \\ \hdashline \color{red}d_{12} \\ \color{red}d_{13} \\ \end{bmatrix} + \begin{bmatrix} {r_{2}} \\ {r_{3}} \\ {r_{4}} \\ {r_{5}} \\ \hdashline {r_{6}} \\ {r_{7}} \\ {r_{8}} \\ {r_{9}} \\ {r_{10}} \\ {r_{11}} \\ {r_{12}} \\ \hdashline {r_{13}} \\ {r_{14}} \\ \end{bmatrix} \end{align*}\]

This makes sense. If most of the energy was in the later channel taps and the real channel is bigger than the model, we’ll indeed want to use a slightly later slice of the received symbols!

If we have a slightly larger channel than the model, we can use the energy distribution – which we can find out via a simple correlation, if our training sequences are reasonable (its autocorrelation “sharp”, mostly zero except at zero delay ^{1}) – to figure out where on the signal to run the least-squares estimation. The amount of error presumably depends on the characteristics of the channel itself. Also, there are more advanced methods for channel estimation, most notably MMSE, which requires knowledge (or estimation) of noise and channel statistics. I can understand how one could estimate the noise statistics (if we have strong enough symbols we can wipe off the noise), but it’s slightly unclear to me how one estimates the channel statistics….if one’s trying to…estimate the channel? If you want to explain how this gets done in real-world systems, I would be delighted to hear about it!

I think what I’ll do next is try to formalize and write code for the time/frequency offset estimation, and get that correctly cueing the channel estimation on the right part of the signal. The goal of this series is to do a survey of the *all* the necessary “minimally viable” signal processing elements that compose a reasonable (similar data rates, similar channel properties, similar performance) GSM-ish receiver, not to explore *all* the possible methods (there are many of them, and people keep coming up with more!) for each signal processing block.

We send \(TS\), and receive \(TS \ast chan\) (\(TS\) convolved with the channel impulse response). If we want to estimate the channel with a simple correlation, the receiver computes \((TS \ast chan) \star TS\), where \(\star\) is the

*correlation*operator. The properties of convolution and correlation let us rewrite that as \(chan \ast (TS \star TS)\) – the channel impulse response itself, correlated with the autocorrelation of the training sequence. The closer the training sequence autocorrelation is to zero (besides at zero delay), the more accurate the simple correlation method’s estimate of the impulse response.↩︎

we use the code from these two posts to enable mathjax and syntax highlighting with hakyll (errors, omissions, and indentation butchery below are my own, I am not (yet?) a professional haskell programmer)

```
syntaxHighlightingStyle :: Style
= haddock
syntaxHighlightingStyle
=
mathExtensions Ext_tex_math_dollars
[ Ext_tex_math_double_backslash
, Ext_latex_macros
,
]=
codeExtensions Ext_fenced_code_blocks
[ Ext_backtick_code_blocks
, Ext_fenced_code_attributes
,
]
= writerExtensions defaultHakyllWriterOptions
defaultExtensions = foldr enableExtension defaultExtensions (mathExtensions <> codeExtensions)
newExtensions
= defaultHakyllWriterOptions {
pandocWriterSoupedUpOptions = MathJax "",
writerHTMLMathMethod = newExtensions,
writerExtensions
writerH-- [...]
"css/syntax.css"] $ do
create [
route idRoute$ do
compile $ styleToCss syntaxHighlightingStyle makeItem
```

and without further ado, we have

\[\begin{eqnarray} x+1 = 2 \\ y+2 = 3 \end{eqnarray}\]

\[\begin{bmatrix} 1 & 2 & 3\\ a & b & c \end{bmatrix}\]

the code for this was inspired from this stackexchange post:

\[\begin{align*} & \begin{bmatrix} m_{0} & m_{1} & m_{2} & m_{3} \\ m_{4} & m_{5} & m_{6} & m_{7} \\ m_{8} & m_{9} & m_{10} & m_{11} \\ m_{12} & m_{13} & m_{14} & m_{15} \end{bmatrix} \\ \begin{bmatrix} v_{0} & v_{1} & v_{2} & v_{3} \end{bmatrix} & \mspace{5mu} \bigl[\begin{matrix} {r_{0}} & \mspace{15mu} {r_{1}} & \mspace{15mu} {r_{2}} & \mspace{15mu} {r_{3}} \end{matrix} \mspace{15mu} \bigr] \end{align*}\]

]]>Here’s some interesting papers and webpages that I have hanging around in open browser tabs. Better to have them here than languishing in browser tabs/history/bookmarks!

- two sets of lecture notes on spatial/array processing that look at different criteria (zero forcing vs error minimization) and deterministic vs stochastic (which I think is a synonym for Bayesian) approaches
- “Beamforming: a versatile approach to spatial filtering” by B.D. Van Veen; K.M. Buckley
- The entire set of notes from the NATO “Advanced Radar Systems, Signal and Data Processing” (RTO-EN-SET-086bis) lecture series

“Cyclic Wiener filtering: Theory and Model” by Gardner

- normal filters add up multiple copies of the same signal, but time-offset
- array processing adds up multiple copies of the same signal, but
*space*-offset - FRESH (FREquency SHift) filters add up multiple copies of the same signal, but
*frequency*-offset - this is useful because many signals (like communication/radar RF signals) have redundancy/correlation in their frequency domain (a property called cyclostationarity)

“Noncircularity exploitation in Signal Processing Overview and Application to Radar” by F. Barbaresco, Pascal Chevalier; about widely linear processing/filtering/estimation

- a lot of time it’s justified to assume that complex-valued signals through complex-valued systems behave the same as real valued signals and systems (and using the same sort of filters / estimators you’d use for real-valued everythings)
- pretending that complex signals work just like real signals depends on an assumption called “second-order circularity”
- second-order circularity doesn’t always hold!
- for instance if the signal (prior to passing through the channel) only takes a real value (like -1 or 1, like with a BPSK), then there’s a
*fundamental asymmetry*between the inphase and quadrature channels, and that violates the second-order circularity assumption. - note: a symmetric QAM signal (modulated with random data, as always) is itself not circularly symmetric (add a phase offset and the little square lattice gets tilted) but it
*is*second-order circular - if second-order circularity doesn’t hold and you process the received signal in a way that can’t tease apart the asymmetry then you are leaving signal on the table.
- in the case where the modulated signal is only real-valued (or can be transformed to be only real-valued) that special signal structure morally lets you get a sort of
*processing gain*because you know that any variation in the complex axis is noise/interference/etc: - a linear filter looks like \(y = h\cdot x\) (\(y\) output, \(h\) coefficients, \(x\) input), the
*widely-linear*model looks like \(y = g \cdot x + h \cdot x^*\) (\(y\) output, \(h\) and \(g\) coefficients, \(x\) input, and \(x^*\) the complex conjugate of \(x\)) – so it’s linear in both \(x\) and its complex conjugate \(x^*\) - as i understand it, this lets the system do stuff like “take only the real part of the signal” (because the noise all lives in the imaginary axis) but in a principled way

“Widely Linear Estimation with Complex Data”, by Bernard Picinbono, Pascal Chevalier, also about widely linear processing

“Receivers with widely linear processing for frequency-selective channels” by H. Gerstacker; R. Schober; A. Lampe: more about widely linear processing

Widely linear filtering isn’t new: “Conjugate linear filtering” by W. Brown; R. Crane is from 1969!

“Enhanced widely linear filtering to make quasi-rectilinear signals almost equivalent to rectilinear ones for SAIC/MAIC” by Pascal Chevalier, Rémi Chauvat, Jean-Pierre Delmas

- we saw earlier that if a signal (as transmitted) has a special form and only lives in the reals (like BPSK or a PAM), this allows for a form of processing gain at the receiver
- even more interestingly, this allows for
*signal separation*/*interference cancellation*(if both the desired and interfering signal are of this form): the receiver can adjust the phase of the received signal until the desired signal lives only on the reals (this is a linear operation), and*trash*the imaginary component of the signal altogether - the real-world realization is more complex since there are two channels (desired signal channel, interferer signal channel) that need to be taken into account, but this actually works: it’s called “single antenna interference cancellation” (SAIC)
- some papers about SAIC:
- “Performance bounds for cochannel interference cancellation within the current GSM standard”
- “A Single Antenna Interference Cancellation Algorithm for Increased GSM Capacity”
- “Single antenna interference cancellation (SAIC) for GSM networks”

- the titles of those papers implies that this is deployed for GSM networks, which notably uses GMSK, which is definitely not BPSK nor a PAM
- however, it turns out we can use this “single antenna interference cancellation” for certain modulations that aren’t BPSK or a PAM, with an additional step: the infamous “derotation”, which converts an MSK into BPSK, and converts GMSK into an almost-BPSK (“almost” because of the second Laurent pulse)
- this paper goes well beyond the standard SAIC; looking into both widely-linear filtering
*and*FRESH filtering, in order to exploit the spectral structure of the signal of interest

two books i found that might be useful later

- Wideband SDR Platform on a Budget, Update # 2, Observations of Starlink Downlink w/ Software Defined Radio by reddit user christianhahn09: amazing SDR built with devboards for:
- TI ADC16DX370 dual-channel, 370 Msps, 16-bit ADC
- ADI LTC5594 wideband I/Q demodulator
- Xilinx Kintex-7 KC705 FPGA

- a Windfreak SynthHD PRO v2, dual channel RF signal generator

the reduceron reconfigured and re-evaluated (paper and slides) and Graph Reduction Hardware Revisited: a microarchitecture that does graph reduction

some stuff about haskell’s STG-machine and execution model:

- “A Haskell Compiler” by David Terei
- SPJ’s Implementing Lazy Functional Languages on Stock Hardware: The Spineless Tagless G-machine
- “Lazy evaluation illustrated for Haskell divers” by Takenobu Tani: “The STG-machine is the marriage of Lambda calculus and Turing machine”

- “EverParse: Verified Secure Zero-Copy Parsers for Authenticated Message Formats”:
- real-world data formats are rife with protocol-meaningful numbers (indices/offsets/counts/lengths/ranges), and therefore context-sensitive
- trying to parse them with hand-written code often leads to parsing/semantic validation/action code being blended together in unprincipled and insecure ways (“shotgun parsers”)
- using parsers generated from language descriptions would improve the situation; except that most parser generators are meant for context-free grammars (stuff that looks like a programming language, not an IP packet or a PDF file)
- EverParse addresses this task for TLVish (tag length value) formats

- The computational power of Parsing Expression Grammars
- Implementation and Optimization of PEG Parsers for Use on FPGAs
- Research Report: The Parsley Data Format Definition Language
- A Verified Packrat Parser Interpreter for Parsing Expression Grammars

- overleaf: in-browser LaTeX editor/typesetter
- 0xabadidea’s backlog post – which inspired me to do this poast
- ask useful questions
- some math tricks, poasted by Terence Tao
- maintaining momentum
- Why and how to write things on the Internet
- Transformers from scratch
- bird SQL
- not knowing

In this GSMish scenario we don’t actually need pinpoint/“fine” timing/phase accuracy, since a good enough Viterbi demodulator effectively “cleans up” remaining timing/phase offset as long as it’s fed with an accurate enough channel estimate (especially if it’s able to *update* its channel estimate).

In a simplistic scenario, if our channel looks like \([1,1]\), it doesn’t matter if the channel estimator outputs \([1,1,0,0,0,0,0,0]\) or \([0,1,1,0,0,0,0,0]\) (here we are using the classic GSM design choice of making our channel estimator handle channels of length 8) or anything up to \([0,0,0,0,0,0,1,1]\) – we get the same results at the end. If we’re misaligned enough to get \([0,0,0,0,0,0,0,1]\) we *are* leaving half the energy in the received signal on the table, so we do want as much energy possible in the actual channel’s impulse response to appear within the channel estimate the demodulator is given.

Of course, with a more realistic case, the actual channel won’t be just two symbols long, this *is* terrestrial radio, not a PCB trace / transmission line nor an airplane-to-satellite radio channel :p

In the case where the physical channel has a length commensurate with the channel length designed in the channel estimator / demodulator, we want to make sure that our least-squares channel estimator gets aimed at the right place in the burst – if it ingests lots of signal affected by unknown data (as opposed to known training sequence data affected by an unknown channel), its output will be kinda garbage.

We’d be at an impasse^{1} if the least squares estimator was our only tool here, but we have a simpler tool that’s more forgiving of misalignments: cross-correlating the received signal against the modulated training sequence. Another way of thinking of this is that we’re running our received signal through a matched filter (with the reference/template signal the modulated training sequence) – it’s literally the same convolution.

Doing this gives us something that looks like this:

Using the Mk I eyeball, it’s pretty clear where the training sequence lives – at the tallest peak.

For implementation in software or gateware, we can encode this logic pretty easily: calculate the correlation, then iterate and look for the biggest peak. However, we notice that there’s a bunch of spurious peaks all around, and it’d be quite bad if we accidentally matched on a spurious peak: the channel estimate would be garbage, and the output of the demodulator would be beyond useless, since it wouldn’t even be starting off at the right spot in the signal.

We can avoid this failure case by running the correlation on a smaller window, which reduces the chances of hitting a false correlation peak. We determine the position of the smaller window using our prior knowledge of the transmitted signal structure – where the training sequence lives relative to the start of the signal – and an estimator to determine when the start of the signal happens.

It’s pretty easy to determine when the start of the signal happens: square/sum each incoming I/Q pair to get a magnitude, and keep a little window of those magnitudes and when their sum exceeds a threshold, well, that’s when the signal started.

We use this to narrow down the possible locations for the training sequence in the received signal. However, we still should run the correlation since this energy-detection start-of-signal estimator has more variance than the correlation-based timing offset estimator.

Incidentally, the GSM training sequences (and lots of training sequences in other well-designed wireless communications systems) have interesting properties:

- their power spectra are approximately flat
- their autocorrelation have a tall narrow peak that approximates an impulse, and has much less energy elsewhere

The former is a desired property since we want to evenly probe the frequency response of the bandpass channel. Spreading the training sequence’s power unevenly (lots of power in one part of the passband and much less in another part of the passband) causes a worse signal-to-noise^{2} ratio in the parts of the passband with less training sequence power. It’s a zero-sum affair since the transmitter has finite transmit power.

The autocorrelation property not only lets us use these training sequences for time synchronization, but it lets us use correlation as a rough channel impulse response estimate. If we’re satisfied with a very suboptimal receiver, we can just use the correlation as our channel estimate. However, least-squares generally will give us a more accurate channel impulse response, since the autocorrelation of the training sequence is not 1 at zero lag and 0 elsewhere – there’s little sidelobes:

If you don’t have a good intuition for what a narrow autocorrelation does here, you can develop one by going to a loading dock or a construction site and paying attention when big trucks or earthmoving equipment back up. See, those big rigs are required to have a back-up beeper to warn bystanders that the driver is backing up and can’t see well what’s behind the vehicle.

There’s two common types of back-up beeper, and unfortunately the more common kind outputs a series of beeps of a single pure tone (without changing frequency between beeps). If you close your eyes and only use that sound to determine where the truck is, you’ll find it’s quite a difficult task: it seems like the sound is coming from *everywhere*! The brain has a variety of mechanisms to localize sources of sound, and besides the ultra-basic “find which ear is receiving the loudest signal” method many of them kinda boil down to doing cross-correlations of variously-delayed versions of the left ear’s signal against variously-delayed versions of the right ear’s signal, and looking for correlation peaks. Seems familiar!

Unfortunately, the pure sine tone is the *worst* possible signal for this, since there’ll be tons of correlation peaks (each oscillation of the sine wave is identical to its precursor and successor), and if there’s audio-reflective surfaces around you and the truck, there’ll be tons of echoes too. Ambiguities galore! More spurs than a cowboy convention!

Ironically, **the most useful** (for angle-of-arrival localization) **part of the pure-tone truck beeper’s signal is the moment the beep starts**^{3}, since the precursor is *zero* – the rest of the beep is comparatively useless for localization (an estimation task) but extremely useful for knowing that there’s indeed a truck **somewhere in the neighborhood** backing up (a detection task). The start and end of the beep are the most spectrally rich part of the beeper’s output, and this is indeed what we expect.

The pure sine wave is the *easiest* possible signal to detect (with our friend the matched filter), but the *worst* possible signal for localization; and this irony is why you can hear truck back-up beepers from *uselessly* far away but can’t easily tell which truck is backing up.

Fortunately, there’s truck back-up beepers that output sounds far more amenable to localization: little bursts of white noise. If you haven’t heard those, you can find a youtube video of those in action, play it on your computer, and try and localize your *computer’s speakers* with your eyes closed.

You’ll notice that this is basically the optimal signal if you want to do angle-of-arrival estimation with delays and correlations – there’s only one correlation peak, and it’s exactly where you want it. It’s also extremely spectrally rich, and it has to be, since spectrally poor signals have worse autocorrelation properties. It also has the advantage of “blending in” with other noise: on-and-off bursts of white noise get “covered up” by white noise (and become indistinguishable from white noise) very quickly, a pure tone is much more difficult to cover up with white noise.

This is what a good training sequence looks like: simple correlation gets you a passable estimate for the channel impulse response along with the timing offset, since the autocorrelation approximates an impulse. Also, the spectral richness ensures that all the frequency response of the bandpass channel is probed.

I don’t think there’s too much useful we can do with the coarse correlation-based channel estimation to enable a more accurate channel estimation with more advanced (least-squares) methods – I had imagined looking at the coarse correlation-based channel estimate and looking for a window with the most energy and then doing a least-squares channel estimate only on that window, but I don’t think that actually has realistic benefits.

However, that idea (focusing on where energy is concentrated in the channel impulse response) *does* point to a more fructuous^{4} game we can play with channel impulse response: transforming the channel to *squash* the channel’s energy as much as possible into the earlier channel coefficients, and this is called “channel shortening”. Channel shortening is interesting because rather than having to delay decisions until the last possible moment, we can commit to decisions earlier, which reduces the computational burden (and area/power requirements) on a Viterbi-style demodulator pretty significantly.

If the impulse of the channel is highly front-loaded into, say, the first 3 symbols, we force a decision after only 3 symbol periods, since the likelihood of something *after that* making us change our mind is very unlikely. We still keep track of the *effect* of our decisions for as long as the channel lasts, since otherwise we’ll be introducing actual error (even if we make all the right decisions) that’d be pretty harmless to avoid: once we made the decisions, figuring out their effect is as simple as feeding them through a channel-length FIR filter.

maybe not, i am unsure if looking at the least square residuals would be enough to determine lack of time synchronization↩︎

which I am assuming to be distributed evenly across the passband↩︎

the moment the beep ends is theoretically the same but your ears are more desensed than when the beep

*starts*↩︎I’ve

*always*wanted to use that word (or rather, its French cognate “fructueux”) in writing.↩︎

In my post on least-squares channel estimation, I had done some reasoning about which received samples can be safely (they’re not affected by unknown data) used for a least-squares channel estimation:

The simple way to cope with this is to refuse to touch the first \(L-1\) samples, and run our channel impulse response estimate over the \(M-L+1\) samples after those. In GSM, this still gives us good performance, since for \(M=26\), \(L=8\) we have 19 samples to estimate 8 channel coefficients. Note that we also can’t use the trailing (in the scan, the last 4 rows) received symbols, since those

alsoare affected by unknown data.

Now, our convolution matrix has dimensions \(M-L+1\) by \(L\), which makes sense, the only “trustworthy” (unaffected by unknown data) symbols are \(M-L+1\) long, and we are convolving by a channel of length \(L\).

Figuring out the exact offset for

`interference_rx_downsampled`

has been a bit tricky, and I haven’t yet dived into writing the right correlation to estimate the exact timing offset required.

From playing around some more in MATLAB with my source code, I realized I still don’t have a strong understanding of the exact offsets/indices/lengths at play here.

Rather than stare at algebraic expressions, we will draw pictures that speak to the physical meaning of the problem to help us reach expressions we actually *understand*.

We’ll take a generic GSM-like^{1} transmitted burst that is composed of \(D_1\) data bits, followed by a *midamble* of \(TS\) training symbol bits, and \(D_2\) data bits.

Here’s what the burst looks like. I’ve written down the indices (starting at 1) for the first and last bit in each section.

We note that all the lengths are correct:

- First data section is from \(1\) to \(D_1\) so its length is \(D_1-1+1 = D_1\)
- Midamble is from \(D_1+1\) to \(D_1 + TS\) so its length is \(D_1+TS-(D_1+1)+1 = TS-1+1 = TS\)
- Second data section is from \(D_1+TS+1\) to \(D_1+TS+D_2\) so its length is \(D_1 + TS + D_2 - (D_1 + TS + 1) + 1 = D_1-D_1 + TS - TS + D_2 - 1 + 1 = D_2\).
- Total burst is from \(1\) to \(D_1+TS+D_2\) so its length is \(D_1 + TS + D_2 - 1 + 1 = D_1 + TS + D_2\).

It’s clear how we can isolate any particular section of this burst before it has passed through a dispersive channel.

- \([1,5]\) means “1 to 5, inclusive of the bounds (”closed”) on both sides”, and represents \({1,2,3,4,5}\)
- \((1,5)\) means “1 to 5, non-inclusive of the bounds (”open”) on both sides”, and represents \({2,3,4}\)
- We also can have left-closed right-open: \([1,5)\) is inclusive of the \(1\) but not of the \(5\) so we have: \({1,2,3,4}\)
- And likewise with left-open right-closed: \((1,5]\) represents \({2,3,4,5}\)

As the subtitle in the header insinuates, a dispersive channel is represented by a convolution. The structure of convolution tells us that each transmitted sample will affect multiple received samples, and the channel vector’s finite length tells us it’s not gonna be all of them.

We note that a single sample will be “smeared out” by a channel of length \(L\) onto a span that’s \(L\) long:

As for the indices, if this sample lives at index \(n\), the index of this little “span of influence” will be \([n, n+L-1]\). Why these indices?

- the starting index: We currently don’t care
^{3}about*absolute delays*, just what happens*inside the delay spread*. Remember the “ideal coaxial cable” thought experiment from our last post: the problem remains identical no matter how much ideal coaxial cable lives between our receiver antenna and our receiver frontend. We can therefore say that the input sample at index \(n\) gets transmogrified by an “identity channel” (impulse response of \([1]\), it doesn’t change the signal at all) to be an output sample at index \(n\) – no need to add any offset.

This means that the

**first**output sample to be affected by our input sample will be at index \(n\), which justifies the left-closed (includes its boundary): \([n,\)the ending index: If the “span of influence” is \(L\) long, the last sample that is affected by our input sample will be at index \(n+L-1\). This justifies the right-closed (includes the boundary): \(,n+L-1]\)

Going back to our “single element convolution” example, if the \(x\) input sample lives at index \(10\), the first nonzero output sample lives at index \(10\) by fiat. We observe nonzero output samples at \(11, 12, 13, 14\) as well. Output sample \(15\) and beyond are zero, as are samples \(9\) and lower. This means that we have nonzero output at \([10, 14]\), and if we let \(n=10\) and \(L=5\) we get \([10, 10+5-1]=[10,14]\), which matches up with what we see.

There is a definite structure to the transmitted burst: known data (the training sequence) sandwiched by unknown data. In realistic systems, the designers will select a training sequence length longer than any reasonable channel they expect to contend with, and so we expect:

- some received samples will be a function only of unknown data
- some received samples will be a function of unknown data and training sequence bits
- some received samples will be a function only of training sequence bits

To figure out which received samples are which, let’s draw out what happens when our burst gets convolved with a channel of length \(L\). Each transmitted symbol will get “smeared out” onto an \(L\)-long span, and we focus on the symbols at the boundaries of each section.

The center line represents what the receiver hears, and for clarity, we draw the unknown data sections above the center line and the training sequence below the center line.

Things are much more clear now!

- \([1, D_1]\), with length \((D_1)-(1)+1=D_1\): the output’s only affected by the first data section
- \([D_1+1, D_1+L-1]\) with length \((D_1+L-1)-(D_1+1)+1=L-1\): the output is affected by the first data section
*and*the training sequence - \([D_1+L, D_1+TS\) with length \((D_1+TS)-(D_1+L)+1=D_1+TS-D_1-L+1=TS-L+1\): the output is only affected by the training sequence.
**This is the section we use for a least-squares channel estimate!** - [\(D_1+TS+1, D_1+TS+L-1]\) with length \((D_1+TS+L-1) - (D_1+TS+1) + 1= D_1 + TS +L -1 -D_1 -TS -1 +1= L-1\): the output is affected by the training sequence
*and*the second data section - \([D_1+TS+L, D_1+TS+D_2+L-1]\) with length \((D_1+TS+D_2+L-1) - (D_1+TS+L) + 1= D_1 +TS + D_2+L - 1 -D_1 -TS -L +1 = D_2\). This part is only affected by the second data section.

Now let’s sum^{4} up all those lengths to see if our work checks out: \((D_1) + (L-1) + (TS-L+1) + (L-1) + (D_2) = D_1 + L -1 +TS -L +1 +L -1 +D_2 = D_1 +D_2 +TS +L -1\). This is indeed what we get when we convolve a vector with length \(D_1+D_2+TS\) (the total length of the burst as it’s transmitted) by a vector with length \(L\) (the channel)!

As usual, if you notice an error in my work, I’d be very grateful if you could point it out to me.

GSM’s “stealing bits” act like regular bits for modulation/demodulation, and the tail bit structure is not relevant for channel estimation (it will be relevant when we look at trellises).↩︎

or rather, convolving in↩︎

We will soon need to care about absolute delays to solve the

*time synchronization*problem. Not the question of how to get synchronized to UTC or TAI, but rather figuring out when exactly we receive each burst. This is critical since for instance, if the time sync is incorrect, the channel estimator could end up being fed*modulated unknown data*rather than the midamble!↩︎a sum to check our work, call that a check-sum :p↩︎

]]>

Not all modulation schemes have the zero-ISI property that RRC-filtered ^{1} linear modulations have. Continuous-phase modulations (like GMSK, which we’ll be looking at) generally introduce inter-symbol interference: if your receiver recovers symbols by slicing-and-thresholding the received-and-filtered signal, it will have degraded performance – even if its timing is perfect.

This doesn’t prevent us from making high-performance (approaching optimal) receivers for GMSK. If the transmitter has a direct line-of-sight to the receiver and there’s not much else in the physical environment to allow for alternate paths, the channel won’t have much dispersive effect. This lets us approximate the channel as an non-frequency-selective attenuation followed by additive white Gaussian noise. In this case, you can use the Laurent decomposition of the GMSK amplitude-domain waveform to make a more complex receiver that’s quite close to optimal.

The former case is common in aerospace applications: if an airplane/satellite is transmitting a signal to an airplane/satellite or to a ground station, there usually is a quite good line of sight between the two – with not many radio-reflective objects in between that could create alternate paths. The received signal will look very much like the transmitted signal, only much weaker.

If your transmitter and receiver antennae aren’t in the sky or in space, they’re probably surrounded by objects that can reflect radio waves. In fact, they might not even have *any* line of sight to each other at all! You can use your cell phone anywhere with service, not just anywhere you have a cellular base station within line of sight.

If you’ve ever spoken loudly in a quiet tunnel/cave/parking garage, you hear echoes – replicas of your voice, except delayed and attenuated. A similar phenomenon occurs when there’s multiple paths the radio waves can take from the transmitter to the receiver. Think of the channel as a tapped delay line: the receiver receives multiple copies of the signal superimposed on each other, with each copy delayed by the corresponding path delay and attenuated by the corresponding path loss.

Imagine an extreme case: sending symbols at \(1\) symbol per second, and leaving the channel silent for \(1\) second between each symbol. Let’s say we have four fixed paths with equal attenuation, with delays \(50\)ms, \(100\)ms, \(150\)ms, and \(210\)ms. The difference between the shortest path (the path that will start contributing its effect at the receiver the earliest) and the longest path (the path that takes the longest time to start contributing its effect at the receiver) is known as the “delay spread” and here, it’s \(210-50\)ms\(=160\)ms. Initially, receiver gets something very much non-constant: after each of the paths “get filled up”, they appear at the receiver, but only happens in the first \(160\)ms of the symbol. However, after that \(160\)ms, the channel reach equilibrium, and for the remaining \(1000\)ms\(-160\)ms\(=840\)ms, the receiver receives a constant signal. If the receiver ignores the first \(160\)ms of each symbol, it can ignore the multipath altogether!

Note that the *absolute delay* of the paths does impacts the latency of the system, but it doesn’t impact how the channel corrupts the signal. You could imagine the same system, except that there’s 3,000,000 kilometers of ideal (doesn’t attenuate or change the signal, just delays it) coaxial cable between the transmitter and the transmit antenna. That’s gonna add 10 seconds^{2} of delay, but it won’t alter the received signal at all.

This dynamic (symbol time much greater than delay spread) is why analog voice modulation doesn’t need fancy signal processing to cope with multipath. The limit of human hearing is 20 kilohertz, and \(c/(20kHz)=15\) kilometers, which is pretty big – paths with multiple kilometers of additional distance are gonna be pretty attenuated and won’t be very significant to the receiver^{3}.

The higher the data rate compared to the delay spread, the less you can ignore multipath. Increase the symbol rate to GSM’s \(270\) kilosymbols per second, and we get \(c/(270kHz)=1\) kilometer. Paths with hundreds of meters of additional distance aren’t negligible in lots of circumstances!

A high-performance demodulator has to function^{4} despite this channel-induced ISI. It turns out that the same mechanism that needs to handle the channel-induced ISI (which changes based on the physical arrangement of the scatterers in the environment, and is estimated by the receiver, often using known symbols) can also handle the modulation-induced ISI as well.

The “Gaussian” in “GMSK” isn’t a filter that gets applied to the *time-domain* samples. Rather, it’s a filter that gets applied in the *frequency-domain*, and this frequency-domain signal gets used to feed an oscillator – and it’s that oscillator that generates the time-domain baseband signal.

The following 3 diagrams are from the wonderful Chapter 2 of Volume 3 of the JPL DESCANSO Book Series.

The Laurent decomposition tells us that the Gaussian-shaped GMSK frequency-domain pulse, after it gets digested by an oscillator, ends up being equivalent to two time-domain pulses (there are more but they are truly negligible), \(C_0\) (the big one) and \(C_1\) (the small one):

The first Laurent pulse is excited by a function of the current data symbol^{5}. So far, so good. A suboptimal receiver can pretend that a GMSK waveform is only made of \(C_0\) Laurent pulses. If you ignore the \(C_1\) pulse, this reduces GMSK to MSK. **MSK is not a linear modulation, and has nonzero ISI**: the amplitude-domain pulse doesn’t have the zero-ISI property that RRC has.

However, if we have a good phase estimate, we can separate the MSK signal into in-phase (\(I\)) and quadrature (\(Q\)) signals. MSK^{6} has a wonderful property once we’ve decomposed it this way: The “useful channel” alternates between \(I\) and \(Q\) for every symbol and contains no ISI, and the “other channel” (which alternates between \(Q\) and \(I\)) contains all the ISI.

To phrase it another way, on even symbols, the information needed to estimate the symbol is all in \(I\), and the ISI is all in \(Q\), and on odd symbols, the information needed to estimate the symbol is all in \(Q\), and the ISI is all in \(I\). Looking at \(I\) and \(Q\) separately eliminates the ISI, and this lets us make a receiver that looks much like a linear modulation receiver (integrate-and-dumps, comparators, etc) with close to ideal performance.

Stuff gets more interesting if you don’t ignore the second Laurent pulse. What’s that one excited by? Well, it’s a function of the current bit, the previous bit, **and the bit before that**! There’s even a little shift register on the bottom left!

Incidentally, that shift register isn’t just theoretical. If you implement a GMSK modulator with precomputed waveforms in a ROM (as opposed to using a Gaussian filter / integrator / NCO), there’s gonna be a shift register that looks much like that, which helps you index the ROM and postprocess the ROM output. I implemented a GMSK modulator in Verilog that uses precomputed waveforms, with the paper “Efficient implementation of an I-Q GMSK modulator” (doi://10.1109/82.481470 by Alfredo Linz and Alan Hendrickson) as a guide.

There’s 16 possible waveforms you need to be able to generate (8 possible values of the shift register; I and Q for each), but the structure of the modulation lets you cut down on ROM required: if you can time-reverse (index the ROM backwards) and/or sign-reverse (flip the sign of the samples coming *out* of the ROM), you can store just 4 basic curves in the ROM and generate all 16 waveforms that way.

Unlike with RRC, there’s no magic filter that nulls out GMSK’s ISI/memory. Unlike with MSK, separating \(I\) and \(Q\) doesn’t neatly separate the data and the ISI.

Every time a demodulator receives a new sample (or receives \(n\) new samples if there are \(n\) samples per symbol), it needs to decide what symbol was most likely to generate that sample. If it didn’t do something like that, it wouldn’t be much of a demodulator.

If the modulator has no memory, this task is pretty simple: we look at the sample values **each possible** symbol would have generated, and we compare each of those gold-standard values against the value we *actually received*. Which symbol was most likely to have been sent? The symbol whose value is the closest to what was actually received.

How accurate is this? Depends on how many possible symbols there are! Increase the number of possible symbols (“bits per symbol”, “modulation order”), and this decreases the amplitude of noise necessary to sufficiently shift the received sample such that the closest symbol is incorrect.

If the modulator has memory, this task is more complicated. The signal that the modulator generates for a symbol don’t just depend on the current symbol, but on a certain number of past symbols as well.

If the demodulator wants to extract the most possible information from the received signal, it needs to read the modulator’s mind.

Assume the demodulator has access to a perfect mind-reading channel: we can see into all of the modulator’s state – except for what’s affected by the current symbol. The latter proviso prevents the demodulator’s task from becoming trivial. Via the mind-reading channel, the demodulator knows the last two bits the modulator sent: call them \(b_1\) and \(b_2\). There’s a standard assumption that the transmitted signal is a random bitstream, so knowing \(b_1\) and \(b_2\) gives the demodulator strictly zero information about \(b_3\).

The demodulator actually has to estimate \(b_3\) from the noisy received signal, like usual. However, that task is actually solvable now! We have a local copy of a GMSK modulator, and we generate two candidate signals: one with the sequence \((b_1, b_2, 0)\), and one with the sequence \((b_1, b_2, 1)\). If what was actually received is closer to the former, we decide a \(0\) was sent, if the latter is closer, we decide a \(1\) was sent.

You see where this is going! We estimated a value for \(b_3\) – call it \({b\_estimated}_3\) – by comparing the two possible alternatives. Now, when the modulator sends \(b_4\), we don’t need the mind-reading channel anymore! **We already have our best estimate for what \(b_3\) was, and we can use that \({b\_estimated}_3\) to find \(b_4\)!** Indeed, we use our local GMSK modulator to modulate \((b_2, {b\_estimated}_3, 0)\), and \((b_2, {b\_estimated}_3, 1)\) and use that to determine what \(b_4\) likely is.

Unfortunately, eschewing the mind-reading channel isn’t free. The clunky \({b\_estimated}_3\) notation foreshadowed that \({b\_estimated}_3\) and \(b_3\) aren’t guaranteed to be equal. \({b\_estimated}_3\) might be the best possible estimate we can make but it still can be incorrect!

If \({b\_estimated}_3 \neq b_3\) and we try and guess what \(b_4\) is by using \((b_2, {b\_estimated}_3, 0)\) and \((b_2, {b\_estimated}_3, 1)\) as references, we’re in for a world of hurt. The error with \({b\_estimated}_3\) is forgivable (there’s noise, errors happen), but using an incorrect value of \(b_3\) to estimate \(b_4\) **propagates that error into \({b\_estimated}_4\)**…which will propagate into \({b\_estimated}_5\), and so on.

We want to *average* out errors, not *propagate* them!

If we still had our mind-reading channel, we would know the true value of \(b_3\) was (of course, only after we commit ourselves to \({b\_estimated}_3\), otherwise the game is trivial), and use that to estimate \(b_4\), by using \((b_2, b_3, 0)\) and \((b_2, b_3, 1)\) for comparison against our received signal.

We’re at a loss here, because mind-reading channels don’t exist, but if we don’t use the mind-reading channel, our uncertain guesses can amplify errors.

It turns out we were almost on the right track. We can turn this error-amplification^{7} scheme into something truly magical (a sort of magic that actually exists) if we

We have to make decisions on uncertain data. However, this doesn’t oblige us to make a decision for \(b_i\) *as soon as it is possible to make a better-than-chance decision* for \(b_i\)! If there’s useful data that arrives *after* we have committed to a decision on \(b_i\), we’re throwing that data away – at least when it comes to estimating \(b_i\).

In fact, if we want to do the best job we can, we’ll keep accumulating incoming data until the incoming data tells us *nothing* about \(b_i\). Only then will we make a decision for \(b_i\), since we’ve collected all the relevant data that could possibly be useful for its estimation.

But how do we add up all that information? What metrics get used to compare different possibilities? How will this series of selections estimate the sequence of symbols that most likely entered the modulator? And how do we avoid a combinatorial explosion?

- “Efficient implementation of an I-Q GMSK modulator” (doi://10.1109/82.481470 by Alfredo Linz and Alan Hendrickson)
- “Comparison of Demodulation Techniques for MSK” by Uwe Lambrette, Ralf Mehlan, and Heinrich Meyr
- “GMSK Demodulator Implementation for ESA Deep-Space Missions”, by Gunther M. A. Sessler; Ricard Abello; Nick James; Roberto Madde; Enrico Vassallo
- Chapter 2 of Volume 3 (“Bandwidth-Efficient Digital Modulation with Application to Deep-Space Communications”) of the JPL DESCANSO Book Series, by Marvin K. Simon

(with an RRC receive filter at the receiver)↩︎

ideal coaxial cable has a velocity factor of 1↩︎

unless you’re on shortwave/HF, where it

*is*possible to get echoes since the ionosphere sometimes*does*give rise to paths with drastically different distances and without catastrophic attenuation↩︎The equalization task with OFDM is greatly simplified: orthogonal frequency-domain subcarriers + circular prefixes create a circulant matrix. The receiver does a big FFT, and the properties of the circulant matrix means the effect of a dispersive channel is limited to multiplying the output of each subcarrier by a complex coefficient. That complex coefficient is merely the amplitude/phase response of the channel, measured at that subcarrier’s frequency. In real-world systems you need a way to estimate those complex coefficients for each subcarrier (symbols with known/fixed values are useful for this), a way to adapt them as the channel changes over time, and a way to cope with Doppler.↩︎

This figure says “precoded” which means that if you want to get the same result, you need to put a differential encoder in front of the bitstream input; but using this diagram (instead of “Fig. 2-33” in the same chapter) more clearly demonstrates that GSM has a 3-symbol memory.↩︎

for \(h=0.5\) full-response continuous-phase modulations more generally↩︎

This scheme actually works fine if most of the energy in the channel/modulator impulse response lives in the earliest coefficient; since the guesses will just…tend to be right most of the time! However, that’s not generally the case, RF channels are rarely this friendly, unless line of sight dominates. You can shorten an unfriendly channel by decomposing its impulse response into an all-pass filter and a minimum-phase filter (whose energy will indeed be front-loaded), but it probably won’t guarantee you a channel that lets you get away with avoiding a trellis altogether…↩︎