GAMS [ Home | Support | Sales | Solvers | Documentation | Model Libraries | Search | Contact Us ]

Cross Validation
GUSS formulation in GAMS

Cross validation is a statistical/machine learning technique that aims to evaluate the generalizability of a classifier (or other decision) process. It does this by setting aside a portion of the data for testing, and uses the remaining data entries to produce the classifier. The testing data is subsequently used to evaluate how well the classifier works. Cross validation performs this whole process a number of times in order to estimate the true power of the classifier.

Ten-fold cross validation is a special case, where the original data is split into ten pieces, and cross validation is performed using each of these ten pieces as the testing set. Thus, the training process is performed ten times, each of which uses the data obtained by deleting the testing set from the whole dataset. We show below how to carry this out using the Gather-Update-Solve-Scatter (GUSS) facility in GAMS.

A paper with the title "GUSS: Solving Collections of Data Related Models within GAMS" that contains two additional application examples for GUSS is available here.

The GUSS documentation can be found here.

The following example compares the two formulations for a feature-selection model under cross-validation using data files a_data.inc and b_data.inc. The actual source code for both of these GAMS formulations is available here.

Original GAMS formulation (without the GAMS/DEA interface):
$title Ten-fold cross validation example
$eolcom !

$setglobal num_folds 10

set a set for category 1         /1*1505/
    b set for category 2         /1*957/
    o observations               /1*14/
    p folds to perform           /1*%num_folds%/
    f maximum features to select /1*10/

* Read in the data from the data files
parameter a_data(a, o) /
$offlisting
$include "a_data.inc"
$onlisting
/;

parameter b_data(b, o) /
$offlisting
$include "b_data.inc"
$onlisting
/;

set a_test(p,a), b_test(p,b) testing sets
    a_trai(a),   b_trai(b)   training sets;

* Define problem
scalar w_tol /1/
       features /6/;

positive variables a_err(a), sla(a)
                   b_err(b), slb(b);

variables c,
          weight(o),
          gamma;

binary variable y(o);


equations w_def1(o),
          w_def2(o),
          y_def,
          c_def,
          a_def(a),
          b_def(b);

w_def1(o)..
        weight(o) =l= w_tol*y(o);

w_def2(o)..
        weight(o) =g= -w_tol*y(o);

y_def..
        sum(o, y(o)) =e= features;

c_def..
        c =e= sum(a, a_err(a)) + sum(b, b_err(b));

a_def(a)..
        -sum(o, a_data(a, o)*weight(o)) + gamma + 1 =l= a_err(a) + sla(a);

b_def(b)..
         sum(o, b_data(b, o)*weight(o)) - gamma + 1 =l= b_err(b) + slb(b);


model train /all/;

$batinclude gentestset.inc "p,a" "p,b"

set headers report / modelstat, solvestat, objval /;
parameter rep(p,headers);
train.optfile = 0;
option limrow=0, limcol=0, solprint=silent, mip=xpress,
       solvelink=%Solvelink.LoadLibrary%, optcr=0, optca=0;

$echo loadmipsol=1 > xpress.opt

loop(p,
  a_err.up(a) = inf; a_err.up(a)$a_test(p,a) = 0;
  b_err.up(b) = inf; b_err.fx(b)$b_test(p,b) = 0;
  sla.fx(a) = 0; sla.up(a)$a_test(p,a) = inf;
  slb.fx(b) = 0; slb.up(b)$b_test(p,b) = inf;
  solve train using mip minimizing c;
  train.optfile = 1; ! use mipstart for the second run
  rep(p,'modelstat') = train.modelstat;
  rep(p,'solvestat') = train.solvestat;
  rep(p,'objval') = train.objval;
);
display rep;

Options file for the original formulation: xpress.opt
loadmipsol=1

The batinclude file gentestset.inc gives instructions for generating the testing sets. It produces a_test and b_test that detail which equations are left out on solve p.

The actual model is set up to include all the data points in the equations a_def and b_def. To delete the equations that correspond to the test set, we introduce nonnegative slack variables into all the equations. We then set the upper bounds of the slack variables to zero in equations corresponding to the training set, and to infinity in equations corresponding to the testing set. At the same time we fix the error measures a_err and b_err belonging to the testing set by setting their upper bounds to zero. Thus the testing set equations are always satisfiable by choice of the slack variables alone - essentially they are discarded from the model as required. An alternative formulation could "include" the data equations that you need in each scenario, but the update from one scenario to the next in the defining data is much larger.

Cross validation formulated using GUSS: This model essentially mimics what the standard model does, but the implementation of the solver loop behind the scenes is much more efficient, and the consquences are that are clear to see if you execute both model runs. The changes are in the last 40 lines of the GAMS code.
$title Ten-fold cross validation example
$eolcom !

$setglobal num_folds 10

set a set for category 1         /1*1505/
    b set for category 2         /1*957/
    o observations               /1*14/
    p folds to perform           /1*%num_folds%/
    f maximum features to select /1*10/

* Read in the data from the data files
parameter a_data(a, o) /
$offlisting
$include "a_data.inc"
$onlisting
/;

parameter b_data(b, o) /
$offlisting
$include "b_data.inc"
$onlisting
/;

set a_test(p,a), b_test(p,b) testing sets
    a_trai(a),   b_trai(b)   training sets;

* Define problem
scalar w_tol /1/
       features /6/;

positive variables a_err(a), sla(a)
                   b_err(b), slb(b);

variables c,
          weight(o),
          gamma;

binary variable y(o);


equations w_def1(o),
          w_def2(o),
          y_def,
          c_def,
          a_def(a),
          b_def(b);

w_def1(o)..
        weight(o) =l= w_tol*y(o);

w_def2(o)..
        weight(o) =g= -w_tol*y(o);

y_def..
        sum(o, y(o)) =e= features;

c_def..
        c =e= sum(a, a_err(a)) + sum(b, b_err(b));

a_def(a)..
        -sum(o, a_data(a, o)*weight(o)) + gamma + 1 =l= a_err(a) + sla(a);

b_def(b)..
         sum(o, b_data(b, o)*weight(o)) - gamma + 1 =l= b_err(b) + slb(b);


model train /all/;
train.optfile = 1;

$batinclude gentestset.inc "p,a" "p,b"

parameter wval(p,o), gval(p);

set headers report / modelstat, solvestat, objval /;
parameter
    scenrep(p,headers)
    scopt(*) / SkipBaseCase 1, Optfile 1, LogOption 2 /;

set dict / p.     scenario.''
           scopt. opt.     scenrep
           a_err. upper.   aupper
           b_err. upper.   bupper
           sla.   upper.   afree
           slb.   upper.   bfree
           weight.level.   wval
           gamma. level.   gval /

$echo loadmipsol=1 > xpress.opt

Parameter aupper(p,a), bupper(p,b), afree(p,a), bfree(p,b);

aupper(p,a)$(not a_test(p,a)) = inf;
bupper(p,b)$(not b_test(p,b)) = inf;

afree(p,a)$a_test(p,a) = inf;
bfree(p,b)$b_test(p,b) = inf;

option mip=xpress, optcr=0, optca=0;
solve train using mip minimizing c scenario dict;
display scenrep, gval;

Firstly, parameters aupper, bupper, afree and bfree are used to set the bounds on the error and slack variables in the testing set equations respectively. The setting of the upper bounds are governed by the syntax shown in the controlling set dict. Furthermore, the output of the classifier (w,gamma) for each fold of the cross validation uses the dict set to place results into the parameters wval and gval respectively. Finally, the GUSS options are used to guarantee that the subsequent solves are instructed to process solver options (Optfile 1) which instruct the solver to use the previous solution to start the branch-and-cut process (loadmipsol=1).

The complete data and model files for this example are found in (galaxy zip archive). The data and model for a second instance based on the Wisconsin Diagnostic Breast Cancer Database is downloadable as (wdbc zip archive).

Quadratic Programs
GUSS is not limited to linear programs, but can be used more generally. Simple (indexed) quadratic models can be solved using GUSS.

The following example illustrates the use of GUSS for quadratic programs. In this example, a support vector machine is used to determine a linear classifier that separates data into two categories. We use the following model:
(3)   minw,g,z(1/2)||w||22 + CeTz
subject toD(Aw-g)+z >= 1
z >= 0
Here, A is a matrix containing the training data (patients by features) and D is a diagonal matrix with values +1 or -1 (each denoting one of the two classes). C is a parameter weighting the importance of maximizing the margin between the classes (2/||w||2) versus minimizing the misclassification error (z). The solution w and g are used to define a separating hyperplane {x | wTx = g} to classify (unseen) data points.

As given, the standard linear support vector machine is not a slice model per se. It becomes a slice model under cross-validation training, where it is solved multiple times on different pieces of data. In this case, only the data A and D vary between solves, appropriately fitting the definition of a slice model.

The data for this example comes from the Wisconsin Diagnosis Breast Cancer Database, and is available here. The data was converted to the GAMS file wdbc.gms, which defines A and D. The actual source code for the following GAMS formulation is available here.

The GUSS formulation for quadratic svm:
$title Ten-fold cross validation example using GUSS
$eolcom !

$setglobal num_folds 10

set p /1*%num_folds%/;  ! folds to perform

! Read in data
$include "wdbc.gms"

set test(p,i);          ! testing set

! Define problem
parameter C /1/;
positive variables z(i);
variables obj, w(k), gamma, slack(i);

equations obj_def, sep_def(i);

obj_def..  obj =e= 1/2*sum(k, sqr(w(k))) + C*sum(i, z(i));
sep_def(i)..
           D(i)*(sum(k, A(i,k)*w(k)) - gamma) + z(i) + slack(i) =g= 1;

model train /all/;

! Generate testing sets (to be deleted in each problem)
loop(p,
$batinclude gentestset2.inc "p,i"
);

set headers report / modelstat, solvestat, objval /;
parameter
    scenrep(p,headers)
    scopt / SkipBaseCase 1, LogOption 2 /;

set dict / p.    scenario.''
           scopt.opt.     scenrep
           z.    upper.   iupper
           slack.upper.   ifree /;

Parameter iupper(p,i), ifree(p,i);
iupper(p,i)$(not test(p,i)) = inf;
ifree(p,i)$test(p,i)        = inf;

option qcp=conopt, optcr=0, optca=0;
solve train using qcp minimizing obj scenario dict;
display scenrep;

Because the problem is quadratic, we must use a quadratic program solver. The variable values for weight and gamma could be saved for later testing using the same method as detailed above for the linear case.

The batinclude file gentestset2.inc is very similar to gentestset.inc from the earlier cross-validation examples. In gentestset2.inc, though, only one set is being dealt with rather than two. The complete source GAMS code for this formulation is available in this zip archive.