Hi All
I’m running into trouble constructing a good practical example for an upcoming webinar.
I know how to construct an “Ivory Tower” example to illustrate the technique in question, however, I’d like to anchor the problem back to some kind of real world problem to help motivate the methodology
Here’s the background. Please let me know if any of you have encountered something similar to this in real life:
1. Assume that you have a series of data sets
2. You need to generate a regression model to each of these data sets
3. All of your regression models share a common variable
4. You need to constrain your regression such that the regression coefficient for all of the models are identical
(Formally, if you’re working with linear regression, this gets referred to as a “common slope” problem)
I’m hoping that someone has a nice, practical applied example…
Anyone ever run into something similar in Physics or Chemical Engineering?
I’m guessing that there might be some interesting examples estimating parameters for “Equation of State” models however, beggars can’t be choosers…
Also, for what its worth, I'm planning on solving this as a pure Optimization problem, so I'd like to steer awat from mixed effect models...
Page 1 of 1
Applied Statistics Question
#2
Posted 2011-February-23, 10:57
Hi Hrothgar
I'm an economist, not a physicist or chemical engineer, but this may be the sort of thing you have in mind:
Suppose you have time series data for a number of different countries on GDP growth, plus a number of potential drivers (not sure how complicated you want to make this, but could include interest rates, exchange rates, government spending or whatever). In the current economic climate you might well wonder about the impact of oil prices on GDP growth, and what to test whether the impact of world oil prices on growth was the same in each country....
I'm an economist, not a physicist or chemical engineer, but this may be the sort of thing you have in mind:
Suppose you have time series data for a number of different countries on GDP growth, plus a number of potential drivers (not sure how complicated you want to make this, but could include interest rates, exchange rates, government spending or whatever). In the current economic climate you might well wonder about the impact of oil prices on GDP growth, and what to test whether the impact of world oil prices on growth was the same in each country....
#3
Posted 2011-February-23, 11:40
You would presumably prefer an example where it is adequate to use multiple linked regression models rather than pooling all the data and applying a single mixed effect model.
One situation in medical statistics where this (applying multiple linked models) is routinely done is in stratified survival models, in which you have a drug that is assumed to have the same effect of the hazard in each stratum, but the baseline hazards may be different in each stratum.
Another example could be from gene expression modeling when one want to integrate data from different platforms. Say you have some data on gene expression profiles from animals under control conditions and some from animals subject to some kind of treatment, and you want to know which transcripts are affected by the treatment. Since some profiles come from microarrays while others are RNAseq, you can't pool the data as the residual variance is modeled in different ways.
Maybe an example from sports events (my new job!) would be more adequate for a broad audience as the problems there are easier to grasp for non-specialists. Say you want to estimate the general effect of acclimatization factors such as jet-lag, and that you have data from a variety of sports (tennis, soccer, cricket). Since the response functions are completely different for different sports (some with binary win/lose, some with count data #goals, some with continous completion-time data etc) there can be no question of pooling the data. But there may be a universal concept of a baseline-performance of a player which could be scaled so that it is N(0,1) across players in any field, and one could thenhypothize that the offset caused by e.g. having crossed five time zones 24 hours prior to the event would be a constant regardless of sports.
This is a little artificial, in practice one would probably prefer a slightly weaker assumption, say the offset is a constant for each sport and one wants to estimate the distribution of this offset across sports.
One situation in medical statistics where this (applying multiple linked models) is routinely done is in stratified survival models, in which you have a drug that is assumed to have the same effect of the hazard in each stratum, but the baseline hazards may be different in each stratum.
Another example could be from gene expression modeling when one want to integrate data from different platforms. Say you have some data on gene expression profiles from animals under control conditions and some from animals subject to some kind of treatment, and you want to know which transcripts are affected by the treatment. Since some profiles come from microarrays while others are RNAseq, you can't pool the data as the residual variance is modeled in different ways.
Maybe an example from sports events (my new job!) would be more adequate for a broad audience as the problems there are easier to grasp for non-specialists. Say you want to estimate the general effect of acclimatization factors such as jet-lag, and that you have data from a variety of sports (tennis, soccer, cricket). Since the response functions are completely different for different sports (some with binary win/lose, some with count data #goals, some with continous completion-time data etc) there can be no question of pooling the data. But there may be a universal concept of a baseline-performance of a player which could be scaled so that it is N(0,1) across players in any field, and one could thenhypothize that the offset caused by e.g. having crossed five time zones 24 hours prior to the event would be a constant regardless of sports.
This is a little artificial, in practice one would probably prefer a slightly weaker assumption, say the offset is a constant for each sport and one wants to estimate the distribution of this offset across sports.
The world would be such a happy place, if only everyone played Acol :) --- TramTicket
#4
Posted 2011-February-23, 11:54
helene_t, on 2011-February-23, 11:40, said:
Another example could be from gene expression modeling when one want to integrate data from different platforms. Say you have some data on gene expression profiles from animals under control conditions and some from animals subject to some kind of treatment, and you want to know which transcripts are affected by the treatment. Since some profiles come from microarrays while others are RNAseq, you can't pool the data as the residual variance is modeled in different ways.
This one is near perfect...
Thanks!
Alderaan delenda est
#5
Posted 2011-March-07, 14:06
In case anyone cares, I decided to go with a tumor growth model presented in
Modeling of tumor growth and anticancer effects of combination therapy by
Gilbert Koch Æ Antje Walz Æ Gezim Lahu Æ Johannes Schropp
Koch presents a model of tumor growth as a function of time with
1. A set of variables that are dependent on the specific drugs being used
2. Another set of variables that are independent of the drugs being used
Modeling of tumor growth and anticancer effects of combination therapy by
Gilbert Koch Æ Antje Walz Æ Gezim Lahu Æ Johannes Schropp
Koch presents a model of tumor growth as a function of time with
1. A set of variables that are dependent on the specific drugs being used
2. Another set of variables that are independent of the drugs being used
Alderaan delenda est
#6
Posted 2011-March-09, 08:23
Yes, I care. It's good to see this sort of discussion even if my knowledge of the area is scant. If I take an interest in an applied area of mathematics where the events are subject to deterministic rules I am usually pretty sure-footed, or at least I believe that I am. With statistical issues, I have learned to be cautious. Which makes it interesting.
Ken
#7
Posted 2011-March-09, 12:14
kenberg, on 2011-March-09, 08:23, said:
Yes, I care. It's good to see this sort of discussion even if my knowledge of the area is scant.
Here's another example from the same webinar that you might find interesting
Regression analysis is typically based on Ordinary Least Squares (OLS)
OLS makes a number of assumptions about the data.
If those assumptions get violated, OLS can misfire quite badly...
One such assumption is the way in which you measure your residuals (the distance between your predicted value and the observed value). OLS assumes that you measure the distance between predicted and actual perpendicular to dependent variable. This is equivalent to assuming that all your measurements are perfectly accurate. (There's no random noise associated with any of your independent variables)
Here's a fun little experiment that you can do (Well, the math geeks probably find it amusing)
1. Create a series of data points arranged on a plane
2. Add gaussian noise to X, Y, and Z to create a noisy data set
3. Use Ordinary Least Squares to fit a plane to the data
4. Calcuate the sum of the squared errors between the "clean" Z value and the predicted Z value based on noisy X and noisy Y
5. Use Principal Component Analysis to fit a second plane to the data (the first two coordinates define the plane, the third gives a set of residuals orthogonal to the plane)
6. Calculate the sum of the sum of the squared errors between predicted and actual using the Orthogonal Regression
7. Your error term for the orthogonal regression is going to be much smaller
Alderaan delenda est
#9
Posted 2011-March-11, 12:52
Here's some MATLAB code (in case you have access to the program)
%% Generate a set of data points from a plane
clear all
clc
[CleanX, CleanY] = meshgrid(1:10);
CleanX = reshape(CleanX,100,1);
CleanY = reshape(CleanY,100,1);
CleanZ = 3*CleanX + 4*CleanY;
%% Add noise vectors to all three dimensions
NoisyX = CleanX + randn(100,1);
NoisyY = CleanY + randn(100,1);
NoisyZ = CleanZ + randn(100,1);
% create a scatter plot
scatter3(NoisyX,NoisyY,NoisyZ, '.k')
%% Fit a plane to the data
[foo, GoF] = fit([NoisyX NoisyY],NoisyZ, 'poly11')
% superimpose the plane on the scatter plot
hold on
h1 = plot(foo)
set( h1, 'FaceColor', 'g' )
%% Calculate the sum of the Squared Errors in the Z direction
% Calculate residuals
resid1 = CleanZ - foo(NoisyX, NoisyY);
% Square residuals
resid1_sqrd = resid1.*resid1;
% Take the sum of the squared residuals
SSerr1 = sum(resid1_sqrd)
%% Use Principal Component Analysis to perform an Orthogonal Regression
% PCA is based on centering and rotation.
% PCA rotates the data such that dimension with the greatest amount of
% variance is parallel to the X axis. The dimension with the second
% largest amount of variance will be parallel to the Y axis. This operation
% defines a plane. The direction with the third largest variance will be
% parallel to the Z axis. This dimension defines a set of residuals which
% are at right angles to the XY plane.
[coeff,score,roots] = princomp([NoisyX NoisyY NoisyZ]);
basis = coeff(:,1:2)
normal = coeff(:,3)
pctExplained = roots' ./ sum(roots)
% Translate the output from PCA back to the original coordinate system
[n,p] = size([NoisyX NoisyY NoisyZ]);
meanNoisy = mean([NoisyX NoisyY NoisyZ],1);
Predicted = repmat(meanNoisy,n,1) + score(:,1:2)*coeff(:,1:2)';
%% Generate a fit object that represents the output from the Orthogonal Regression
[foo2, Gof2] = fit([Predicted(:,1) Predicted(:,2)], Predicted(:,3), 'poly11')
h2 = plot(foo2)
set( h2, 'FaceColor', 'r' )
%% Calculate the sum of the squared residuals for OLS
% Calculate residuals
resid2 = CleanZ - Predicted(:,3);
% Square residuals
resid2_sqrd = resid2.*resid2;
% Take the sum of the squared residuals
SSerr2 = sum(resid2_sqrd)
%% Generate a new Scatter Plot
figure
scatter3(CleanX, CleanY, CleanZ)
hold on
plot(foo2)
%% Generate a set of data points from a plane
clear all
clc
[CleanX, CleanY] = meshgrid(1:10);
CleanX = reshape(CleanX,100,1);
CleanY = reshape(CleanY,100,1);
CleanZ = 3*CleanX + 4*CleanY;
%% Add noise vectors to all three dimensions
NoisyX = CleanX + randn(100,1);
NoisyY = CleanY + randn(100,1);
NoisyZ = CleanZ + randn(100,1);
% create a scatter plot
scatter3(NoisyX,NoisyY,NoisyZ, '.k')
%% Fit a plane to the data
[foo, GoF] = fit([NoisyX NoisyY],NoisyZ, 'poly11')
% superimpose the plane on the scatter plot
hold on
h1 = plot(foo)
set( h1, 'FaceColor', 'g' )
%% Calculate the sum of the Squared Errors in the Z direction
% Calculate residuals
resid1 = CleanZ - foo(NoisyX, NoisyY);
% Square residuals
resid1_sqrd = resid1.*resid1;
% Take the sum of the squared residuals
SSerr1 = sum(resid1_sqrd)
%% Use Principal Component Analysis to perform an Orthogonal Regression
% PCA is based on centering and rotation.
% PCA rotates the data such that dimension with the greatest amount of
% variance is parallel to the X axis. The dimension with the second
% largest amount of variance will be parallel to the Y axis. This operation
% defines a plane. The direction with the third largest variance will be
% parallel to the Z axis. This dimension defines a set of residuals which
% are at right angles to the XY plane.
[coeff,score,roots] = princomp([NoisyX NoisyY NoisyZ]);
basis = coeff(:,1:2)
normal = coeff(:,3)
pctExplained = roots' ./ sum(roots)
% Translate the output from PCA back to the original coordinate system
[n,p] = size([NoisyX NoisyY NoisyZ]);
meanNoisy = mean([NoisyX NoisyY NoisyZ],1);
Predicted = repmat(meanNoisy,n,1) + score(:,1:2)*coeff(:,1:2)';
%% Generate a fit object that represents the output from the Orthogonal Regression
[foo2, Gof2] = fit([Predicted(:,1) Predicted(:,2)], Predicted(:,3), 'poly11')
h2 = plot(foo2)
set( h2, 'FaceColor', 'r' )
%% Calculate the sum of the squared residuals for OLS
% Calculate residuals
resid2 = CleanZ - Predicted(:,3);
% Square residuals
resid2_sqrd = resid2.*resid2;
% Take the sum of the squared residuals
SSerr2 = sum(resid2_sqrd)
%% Generate a new Scatter Plot
figure
scatter3(CleanX, CleanY, CleanZ)
hold on
plot(foo2)
Alderaan delenda est
Page 1 of 1