Here’s an interesting brainteaser I found. Suppose you are given $n>1$ red marbles and $n$ blue marbles, as well as two bags. You may place any combination of marbles in both bags. After you are done placing the marbles, I will select a bag at random, and select a marble at random from within that bag. Maximize the probability of drawing a red marble by distributing the marbles appropriately.

I challenge you to solve this by brute force! Write down the probability function, take first order conditions, realize that this is an integer programming problem, scratch your head…

That was actually my first attempt. I quickly found out that reasoning is much easier. Read on for my solution after solving it yourself of course! (By the way, it is fairly easy to come up with the correct answer. Can you prove that it is correct?)

First, it’s easy to see that you can do better than simply putting equal allocations in both bags, or by leaving one of the bags empty. For example, by putting a single red marble in the first bag, and the rest of the marbles in the second bag, the probability of choosing a red marble exceeds one half. Since the equal allocation and the equal-empty allocation are clearly not the best allocations, the best allocation must have a different ratio of red marbles to blue marbles in each bag. Since “first” and “second” bag are arbitrary labels, let’s suppose that the first bag contains more red marbles than blue marbles.

If the first bag contains no blue marbles, then the first bag contains only red marbles. If it contains only red marbles, then we can clearly improve the allocation by moving all but one red marble to the other bag. Thus, the optimal allocation cannot contain more than one red marble and no blue marbles in the first bag.

If the first bag contains at least one blue marble, then we can improve the ratio of red marbles to blue marbles by removing one red marble and removing one blue marble because we assumed the first bag has more red marbles than blue marbles (this is a simple arithmetic result that I’m not proving: if $\frac{x}{y}>\frac{a}{b}$ then $\frac{a+x}{y+b}>\frac{a}{b}$). The second bag has fewer red marbles than blue marbles because the first bag has more red marbles than blue marbles. Thus, adding a red marble and a blue marble will increase the ratio of red marbles to blue marbles in the second bag (this is the same inequality discussed in the previous parenthetical remark). Since doing this transfer improves the probability of choosing a red marble, no allocation with any blue marbles in the first bag can be optimal.

Putting this all together, we have shown that in the optimal allocation, one of the bags must be nonempty and have more red marbles than blue marbles; that this bag cannot contain any blue marbles; and moreover, that this bag can contain at most one red marble. Thus, the optimal allocation is one red marble in one of the bags, and the rest of the marbles in the other bag.

What do you think? Do you have an easier solution?

Tagged with:

Suppose we estimate $\mathbb{P}_tf\left(X_{t+1}\right)$ under a model $\mathbb{P}_t$, where the true model is $\mathbb{Q}$. We would like to quantify $\mathbb{Q}d\left(\mathbb{P}_tf\left(X_{t+1}\right),f\left(X_{t+1}\right)\right)$. One naive way to do this is to use the empirical distribution $\hat{\mathbb{Q}}$. Note that if $\mathbb{P}_t=\mathbb{Q}$, then $\mathbb{P}_t$ is a more efficient estimator than $\hat{\mathbb{Q}_t}$. If not, there would be hardly any point to building models! Under mild assumptions, we know that $\hat{\mathbb{Q}}g\left(X_{t+1}\right)\to\mathbb{Q}g\left(X_{t+1}\right)$ almost surely. Thus, we can estimate our model error via $\hat{\mathbb{Q}}d\left(\mathbb{P}_tf\left(X_{t+1}\right),f\left(X_{t+1}\right)\right)$. For example, taking $d\left(x,y\right)=\left(x-y\right)\left(x-y\right)'$, we simply have the sample mean squared error, which decomposes into model covariance and bias covariance.

In other words, uncertainty is present because our model doesn’t perfectly capture cross-sectional variation, and also because our time series predictions are inaccurate. For example, in the case where we have uncorrelated time series estimation error, we have the decomposition $\Sigma=\Sigma_{\mathrm{cs}}+\mathrm{diag}\left(e_1^2,\ldots,e_n^2\right)$. As the estimation errors increase, the implied sample correlation matrix $\mathrm{corr}\left(\Sigma\right)$ shrinks towards a diagonal matrix. That is, the presence of “idiosyncratic” time-series estimation error reduces the predictive power of correlations. This is fairly intuitive: even if our in-sample understanding of pairwise relationships is perfect, this doesn’t help us predict relationships tomorrow if we are clueless about what anything will be.

This decomposition can also be quite dangerous. Instead of shrinking sample correlations towards zero, one could find that by adding the sample bias covariance matrix, correlations are actually magnified. The mistake in applying the decomposition in this way is that the sample bias covariance matrix is not the actual bias covariance matrix, and does not account for its own uncertainty. It is possible to correct for this by a recursive procedure: the sample covariance matrix has estimated error, which has estimated error… and so on.

In practice, how precise should we be? In the context of portfolio optimization, it is perhaps better to be conservative with correlation estimates, so assuming away off-diagonal terms as in our example above may be sufficiently accurate to control for model risk.

Econometric theory is like an exquisitely balanced French recipe, spelling out precisely with how many turns to mix the sauce, how many carats of spice to add, and for how many milliseconds to bake the mixture at exactly 474 degrees of temperature. But when the statistical cook turns to raw materials, he finds that hearts of cactus fruit are unavailable, so he substitutes chunks of cantaloupe; where the recipe calls for vermicelli he uses shredded wheat; and he substitutes green garment die for curry, ping-pong balls for turtle’s eggs, and, for Chalifougnac vintage 1883, a can of turpentine.

Stefan Valavanis, 1959

The above quotation lifted from Valvanis’s Econometrics elegantly summarizes the limitations of applied econometrics, and applied statistics more generally. Every practitioner should print it out to remind him or herself not to take theoretical results too seriously when working with actual data.

Tagged with:

The difference between the countably and uncountably infinite is one of the simplest, and also one of the most important insights in analysis. It’s one of the few examples that I fall back on when I’m trying to convince somebody that math is “cool,” and all of the abstract talk about beauty and elegance fails (one of the other fallbacks is the Banach-Tarski paradox, and another is the central limit theorem). But of course, as quants, you all appreciate that one of the coolest things about math is the little collection of facts that you discover when thinking aimlessly on some random Sunday afternoon. I’d like to share one of these with you.

Let’s start with a very obvious claim: if you let me pick every number in the decimal expansion of a number, then I can pick an irrational number. The proof? Pick the digits of pi. For example, I could name “one, four, one,” and so on, and the decimal expansion that I picked, 0.141593… would converge to pi minus three, which is irrational.

Here’s the less obvious claim: if I let you pick every number in a decimal expansion, then there’s no way you can pick a rational number, if I cleverly tell you a single number you’re not allowed to pick each time. For example, I could say something like “give me another number, except it can’t be five,” or “give me another digit, except it can’t be three.” If I keep letting you pick numbers in this way, then you’re sure to get an irrational number. Let’s prove this.

Start with the classic diagonalization proof that the irrationals are uncountable. Suppose the irrationals were countable. Then the irrationals between zero and one would be countable, so we can set

$n_1 = 0.a_{11} a_{12} \cdots \\ n_2 = 0.a_{21} a_{22} \cdots \\ \vdots \\ n_i = 0.a_{i1} a_{i2} \cdots$

where $n_1, n_2, \ldots$ enumerate all of the irrational numbers, and each $a_{ij}$ is a digit. One can show that the base of the number system doesn’t matter, but we won’t do that here. The main thing to prove is that you can construct a number $n= 0. a_1 a_2 a_3$ such that $n\neq n_i$ for any $i$. You can do this by picking $a_i \neq a_{ii}$ for all $i$, and handling some corner cases where you have something like $0.999\cdots = 1$, even when the digits aren’t the same.

But you already knew that. Here’s the new part.

What if we apply the same proof, but instead of trying to prove that the irrationals are uncountable, we try to prove that the rationals are uncountable? If we follow the above proof, we eventually hit a snag when we pick  $a_i \neq a_{ii}$ for all $i$. Assuming we handle the corner cases in the right way, then it is clear that this constructed $n$ is not on the list of all rational numbers. But this doesn’t mean that the rationals are uncountable–because they are–because $n \in \mathbb{R} - \mathbb{Q}$!

So, using the same familiar proof that the irrationals are uncountable, we have proved that we can construct an irrational number simply by picking any decimal expansion such that the first digit is not some constant, the second digit is not some possibly different constant, and so on. When you combine this with the observation that the order of the rational numbers we used to construct our proof is completely arbitrary, you realize yet again just how much larger the set of irrationals is compared to the rationals.

UPDATE: In the code snippet I originally provided, I used concatenation to build a running “total” of all data read so far. It is far better to use pre-allocation, especially for large datafiles. I left out code for things relating to preallocation because it obscures the main idea of my approach. I have updated the code to include pre-allocation, since one reader pointed out that the original code performs rather badly in practice.

Matlab has a very nice data import wizard that uses the function importdata. One immediate drawback is that the data import process stops without error as soon as non-numeric data is encountered. As your dataset gets large, it becomes increasingly difficult to determine whether all of your data was read correctly without a lot of manual checking. I usually use the function textscan combined with str2double to read all data as strings, and then automatically convert bad data to NaN values. This approach uses up an obscene amount of memory for large datasets. This article provides an alternative method.

Since textscan sets the end of file flag if it successfully reads the entire file, we merely need to check whether the flag is set; if it’s not, then there was an error, so we discard a line of data, and then continue importing with the next line. We continue repeating our attempts at textscan (and the subsequent checking) until we reach the end of the file. The following code snippet shows this method in action:

function [ total ] = robustTextScan( fileName, format, headerRows, lineEstimate, lineEstimateType, varargin )
%UNTITLED2 Summary of this function goes here
%   Detailed explanation goes here

fid = fopen(fileName);

total = cell( size(temp) );

% pre-allocate as well as you can
if ~isempty(lineEstimate) && iscell(lineEstimateType) && length(lineEstimateType) == length(total)
for i = 1:length(total)
total{i} = repmat(lineEstimateType{i}, lineEstimate, 1);
end
end

% if we haven't hit the EOF, it's because there was an error in the current
% line. otherwise, we're done
if ~feof(fid)

lowerIndex = 1;

while ~feof(fid)
% find the shortest list; this is the one that had an error
minLength = min(cellfun( @(x) size(x,1), temp));

if minLength > 0

% chop off everything up to the minimum length
temp = cellfun( @(x) x(1:minLength,:), temp, 'UniformOutput', 0);

% append the chopped off bit to the total output
upperIndex = lowerIndex + minLength - 1;

for i = 1:length(temp)
total{i}(lowerIndex:upperIndex) = temp{i};
end

lowerIndex = upperIndex + 1;
end

% try again after the next line
fgets(fid);

% continue scanning and appending to the total
temp = textscan(fid, format, varargin{:});

minLength = min(cellfun( @(x) size(x,1), temp));
end

% append the chopped off bit to the total output
upperIndex = lowerIndex + minLength - 1;

% set the length equal to the terminal upper index
for i = 1:length(temp)
total{i}(lowerIndex:upperIndex) = temp{i};
total{i} = total{i}(1:upperIndex);
end
else
total = temp;
end

fclose(fid);

end

Tagged with:

This problem was asked by user NicoDean in the freenode #matlab channel (paraphrased and edited for clarity):

Suppose you have a string expression “sin( sin( cos( x ) exp (y ) ) )”. How do you determine the maximum nesting depth, defined by the maximum number of opened and not closed parentheses, in a vectorized way? In the above example, the maximum possible value is 3. Some other examples:

• ‘( ( ( ( ) ( ) ) ( ) )’  has a maximum nesting depth of 4
• ‘( ) ( ) ( )’ has a maximum nesting depth of 1
• ‘( ( ) )’ has a maximum nesting depth of 2

The question is how to compute this value without resorting to a large loop. There is actually a very simple solution:

count = max(cumsum(a == '(') - cumsum(a == ')'));


That’s it! The inside expression computes the number of times that an open parenthesis has been encountered up to index i minus the number of times than a closed parenthesis has been encountered up to index i. The maximum nesting depth is simply the maximum number of opened but not closed parentheses across all indices.

Tagged with:

I introduced Jensen’s alpha in a previous article. One of my readers posted a very insightful comment, which I’ll reproduce here:

What information does Jensen’s alpha add that a simple test of significance would not? For example, if factor X always significantly predicts returns in the cross-section, even accounting for factor Y, then when would X have a negative alpha? More importantly, even if X has a negative alpha, why would you omit X from your model given that it accurately predicts returns?

The answer is that Jensen’s alpha measures strategy performance across time, whereas cross-sectional tests only test predictive power in a single period. Even if X and Y are great at predicting performance in each given period, it may be that both X and Y actually depend on a common factor Z, where Z fluctuates throughout time, but is the same for each asset in a single period. Then, strategies that bet on both X and Y are actually just betting on Z twice. A cross-section test would say that X and Y are significant, whereas a Jensen’s alpha test would show that X and Y are essentially the same strategy. To illustrate this difference, consider the following example.

Suppose, after much research, you find that companies that produce many gadgets and companies that produce many widgets tend to perform well. There is little correlation between the number of gadgets and the number of widgets produced by a given company, so both of these factors are highly significant in the cross-section. You choose to bet on these trends. You find that betting on high-gadget companies tends to result in high returns when betting on high-widget companies also results in high returns, even though both are on average successful. It turns out that when the economy does well, companies that produce many gadgets and companies that produce many widgets are able to sell more gizmos, and thus have much higher returns than when the economy does poorly.

Note that gadget and widget production are still significant predictors of returns, but betting on gadgets and betting on widgets are both strategies that are essentially betting on the economy! Thus, in the portfolio construction process, it would be unwise to bet on both gadgets and widgets at the same time, since this would simply amount to a double bet on the economy.

At this point, you may wonder: why not account for the economy in the original factor model? The answer is that you should! However, it can be difficult, except in the simplest of cases, to determine what the underlying common factor is. More generally, if the common factor fluctuates at a lower frequency than your cross-sectional analysis, then it will be impossible to account for the factor in the cross-sectional analysis.

I hope this helps!

Update: The following Matlab code snippet shows a very simple way to compute Jensen’s alpha for each factor.

% ols.m
function beta = ols(y, x)
beta = (x' * x) \ x' * y
end

% r = matrix of returns
%   size(r) = [N, T]
%   where N = number of firms and T = number of periods
%
% f = matrix of prediction factors
%   size(f) = [N, M, T]
%   where M = number of prediction factors

beta = zeros(T, M);
for t = 1:T
% run a cross-sectional regression each period
beta(t,:) = ols( r(:,t), f(:,:,t) );
end

% compute a jensen's alpha for each factor
alpha = zeros(M,1);
for m = 1:M
tmp = ols( beta(:,m), [ones(T,1), beta(:, setdiff(1:M, m) )] );
alpha(m) = tmp(1);
end

Tagged with:

Recall that the strong form of the efficient markets hypothesis states that prices instantly adjust to reflect private information. This contrasts with the semi-strong hypothesis that prices instantly adjust to reflect public information, and the weak hypothesis that prices reflect all past information. Although the strong form is generally considered too strong, some evidence does occasionally pop up to support it. The Wall Street Journal published an article about the recent insider-trading probe. Here are some interesting snippets:

Alleged insider trading can sometimes become too commonplace to help much… The tip proved correct, but there was one problem: AMD’s shares tumbled the following day, after the news came out… The AMD stock drop likely was because so many other investors also had spoken with Mr. Longoria, “so that, what happens is the news gets out … before the, quote, unquote, event,” according to the complaint.

This is precisely what the strong form of the EMH predicts: profitable insider information will get priced in just as quickly as public information.

Hmm….

Tagged with:

The National Bureau of Economic Research (NBER) hosts a number of interesting conferences every year. The conferences are by invitation only, and provide a glimpse at some of the year’s best working papers across all subfields of economics. The conference proceedings are technically closed to participants only, but the materials are freely available via a “private” website that all graduate students, researchers, and other serious academics should bookmark immediately.

Enjoy!

Tagged with:

It is usually easy to visualize an explicitly defined curve or set of the form $\{y\colon y=f(t),\,t\in [0,1]\}$. It is much more difficult to visualize an implicitly defined set such as $S = \{s\colon f(s)=0 \}$. Understanding $S$ analytically generally requires careful use of the implicit function theorem in all but the easiest of cases. Fortunately, it is possible to visualize such sets in Matlab, as long as the function $f$ is “not too complicated.”

For example, suppose we wish to visualize the set of points where $g(x,y)>0$ on the unit square $[0,1]\times[0,1]$. We can plot the set by computing the function $g(x,y)$ over all points on some grid, and then plotting only the gridpoints that satisfy the condition. The following Matlab code should make this clear:

[x,y] = meshgrid(0:.01:1, 0:.01:1);
condition = g(x, y) > 0; % assume that g is vectorized
plot(x(condition), y(condition), '.'); % plot the points that satisfy the condition


For example, to plot the points on the unit square where $y > x$, we can set $g(x,y) = y - x$:

[x,y] = meshgrid(0:.01:1, 0:.01:1);
condition = y - x > 0;
plot(x(condition), y(condition), '.'); % plot the points that satisfy the condition


which gives the following output:

To add the additional restriction that $x > 0.1$ and either $y< 0.2$ or $y > 0.8$, we code:

[x,y] = meshgrid(0:.01:1, 0:.01:1);
condition = (y - x > 0) & (x > 0.1) & ( y < 0.2 | y > 0.8);
plot(x(condition), y(condition), '.');
axis([0,1,0,1]);


which gives the output:

If $g(x,y,z)$ is a vectorized function for three dimensions, then use code of the form:

[x,y,z] = meshgrid(0:.01:1, 0:.01:1, 0:0.01:1);
condition = g(x, y, z) > 0; % assume that g is vectorized
plot3(x(condition), y(condition), z(condition), '.'); % plot the points that satisfy the condition


For example, suppose we only want the points where $z > x + y$. Then, we code:

[x,y,z] = meshgrid(0:.05:1, 0:.05:1, 0:0.05:1);
condition = z > x + y;
plot3(x(condition), y(condition), z(condition), '.');
axis([0,1,0,1,0,1]);


which gives output:

Simple variations on this type of code will allow you to visualize more complicated sets.

Tagged with: