wiki:DeveloperTransitionGuide

Developer Transition Guide

So you learned how to write FASTLIB code ages ago, and it was great and everyone was happy. Then, somewhere along the line, everything changed and the world was different. But the world is only different, not destroyed (well, depending on your opinion I guess). This document aims to make the transition easy for old FASTLIB developers who know how things used to work but don't want to spend hours perusing the new code to figure out how what they used to do works now.

Output (#15)

The Old

Old FASTLIB code uses a handful of macros for output. They are wrappers for printf-style functions and prepend a few characters to the front of the message to indicate the severity of the message.

  NOTIFY("some message and a number: %d", 5);
  NONFATAL("almost everything is broken");
  FATAL("everything is broken");

gives (I have omitted colors)

[.] some message and a number: 5
[!] some important message
[X] everything is broken

The FATAL macro terminates the application once it displays the message. It is also important to note that these functions automatically append the newline character to the end of the message.

The New

Now, the IO class provides four streams to be used for C++-style stream output (like cout and cerr essentially). These streams are:

  • IO::Debug -- for debugging output; no messages are shown from this stream and no performance penalty is suffered when compiled without debugging symbols (-DDEBUG=OFF: see UsingCMake)
  • IO::Info -- for informational messages; the user can choose (based on an input option) whether or not this level is suppressed (it is suppressed by default)
  • IO::Warn -- for non-fatal warnings; only to be used when there is a non-fatal problem
  • IO::Fatal -- for fatal errors; the program will terminate with an error code (1)

We can use them like this:

  IO::Info << "At the end of our optimization we took " << i << " iterations and reached a final value of " << final_value << "." << std::endl;
  if (something_terrible)
    IO::Fatal << "Something terrible has happened." << std::endl;

Assuming something_terrible is true, this gives (for i = 10 and final_value = 0.04) (again omitting color)

[INFO ] At the end of our optimization we took 10 iterations and reached a final value of 0.04.
[FATAL] Something terrible has happened.

We use IO::Info where we used to use NOTIFY() (although there are cases where IO::Debug would be the better choice). Note that in these methods we must give the newline (std::endl) too. NONFATAL() messages should map to IO::Warn.

Caveat: The IO::Fatal stream will not cause the program to terminate until a newline character is encountered.

What It Gets Us

The printf-style functions depend on argument types being specified (i.e. %d and so forth). FASTLIB defined the LI macro to help with this, but it causes some problems (see #13). Using stream-based output, we do not have to worry about specifying the types of our arguments and they can be arbitrary. Armadillo components can be passed directly to these streams too.

Although we now have to specify newlines, which is mildly inconvenient (especially if you forget it a lot), this allows us to split output onto multiple lines. Excessively long NOTIFY(...) lines can be split into several operations on IO::Info which looks nicer and is more readable.

Matrix Operations (#24)

The Old

The GenMatrix and GenVector classes (in fastlib/la/) used to provide all of the linear algebra support that was used; as a backend, this used LAPACK, which is very fast. Most of the calls, though, were a bit ugly and the API was complex. Below is an example (Nishant's matrix centering method from mlpack/fastica/lin_alg.h):

/**
 * Whitens a matrix using the eigen decomposition of the covariance
 * matrix. Whitening means the covariance matrix of the result is
 * the identity matrix
 */
void WhitenUsingEig(Matrix X, Matrix* X_whitened, Matrix* whitening_matrix) {
  Matrix cov_X, D, D_inv, E;
  Vector D_vector;
   
  Scale(1 / (double) (X.n_cols() - 1),
        MulTransBInit(&X, &X, &cov_X));

  la::EigenvectorsInit(cov_X, &D_vector, &E);
   
  //E.set(0, 1, -E.get(0, 1));
  //E.set(1, 1, -E.get(1, 1));
   
  index_t d = D_vector.length();
  D.Init(d, d);
  D.SetZero();
  D_inv.Init(d, d);
  D_inv.SetZero();
  for(index_t i = 0; i < d; i++) {
    double sqrt_val = sqrt(D_vector[i]);
    D.set(i, i, sqrt_val);
    D_inv.set(i, i, 1 / sqrt_val);
  }

  la::MulTransBInit(D_inv, E, whitening_matrix);
  la::MulInit(*whitening_matrix, X, X_whitened);
}

The New

Now, we use the Armadillo matrix library, which also uses LAPACK as a backend (as well as other libraries). This library provides us with the arma::mat and arma::vec classes (as well as many others -- see the  documentation). Nishant's same method is now written in a much more compact and readable manner:

/**
 * Whitens a matrix using the eigendecomposition of the covariance
 * matrix. Whitening means the covariance matrix of the result is
 * the identity matrix
 */
void WhitenUsingEig(const arma::mat& X, arma::mat& X_whitened, arma::mat& whitening_matrix) {
  arma::mat diag, eigenvectors;
  arma::vec eigenvalues;
   
  // get eigenvectors of covariance of input matrix
  eig_sym(eigenvalues, eigenvectors, ccov(X));
   
  // generate diagonal matrix using 1 / sqrt(eigenvalues) for each value
  VectorPower(eigenvalues, -0.5);
  diag.zeros(eigenvalues.n_elem, eigenvalues.n_elem);
  diag.diag() = eigenvalues;
   
  // our whitening matrix is diag(1 / sqrt(eigenvectors)) * eigenvalues
  whitening_matrix = diag * trans(eigenvectors);
   
  // now apply the whitening matrix
  X_whitened = whitening_matrix * X;
}

It should be noted that in that particular example, VectorPower() is an auxiliary function used in that file.

What It Gets Us

As you can see, the syntax is a lot more elegant, and the method is a lot easier to read. In addition, we do not have to worry about fixing issues with the matrix classes ourselves; the Armadillo developers are generally very responsive to bugs. We do not have fine-grained control over exactly how the library will look and work, which is the small price we pay for outsourcing all of the work associated with maintaining in-house matrix libraries. We can also assume that the Armadillo project will give better and better performance over time, because they have developers who are dedicated to that project and nothing else (whereas in our case, the GenMatrix class was not the full-time work of anyone -- it was an auxiliary class).

Testing Framework (#84)

The Old

The old FASTLIB code has support for testing in fastlib/base/test.h but very little code actually uses it. Some macros were defined, like TEST_ASSERT(), TEST_DOUBLE_APPROX(), and others, but again, they were used sparsely.

The New

The  Boost Unit Test Framework is the solution to our problems. This defines a simple, unified framework we can use to write test cases and verify that our code works. The documentation you need to get started is summarized below, because the Boost UTF (Unit Test Framework) documentation can be a bit difficult at times.

When you write a method, it should always have unit tests. This way, we can (1) know that the method works and (2) when someone changes code inside of it, we still know that it works.  Here is some random blogger's writeup on why unit testing is very important; it's easy to find more justification for unit testing if you look around on the internet. The moral of the story is, you need to write tests for your code.

To begin with, you'll probably write a file called something_test.cc which is testing Something (whatever that may be). Before you include the Boost UTF header, you must define the name of the test:

#define BOOST_TEST_MODULE Something Test
#include <boost/test/unit_test.hpp>

The name of the test module is allowed to have spaces in it.

Technical note: if you read the Boost UTF documentation, you may see that you have to choose how to link to the Boost libraries. In our case the correct choice is dynamically linking, which is done with a #define BOOST_TEST_DYN_LINK before the Boost UTF include. However, this is done in the CMake configuration, so you don't need to worry about it. If you didn't understand this paragraph, don't worry about it.

Now, you want to define an individual test case; this is easy using the automated registration of test cases. An example is below. Unlike the module name, the test case name cannot have spaces in it. A descriptive test case name is helpful.

/***
 * Documentation on what exactly this test is testing.  If you don't document your test, you might as well not even write it.
 */
BOOST_AUTO_TEST_CASE(name_of_test_case) {
  Something something;
  
  double someValue = something.doSomething();

  // We need to make sure someValue is 1.
  BOOST_REQUIRE_CLOSE(someValue, 1.0, 1e-5);
}

So, you write several of these unit tests (the more the better; tests should be small and simple, as opposed to an unwieldy large test which tests everything all at once). You don't need to write a main() function. The only other thing you need to do is link against the Boost UTF library in the CMake configuration for the test, like so:

# CMakeLists.txt
...

add_executable(something_test
  something_test.cc
)
target_link_libraries(something_test
  mlpack
  boost_unit_test_framework
)

Then, the executable something_test will be built automatically and will run all of the test cases you wrote, providing information on errors if they have occurred. The build server makes sure that your tests function correctly, and lets you know if this is not the case.

Because the Boost UTF provides the main() function itself, it also provides handling of the input parameters. You can run the test executable with --help to get a help message or see  here for documentation on the parameters you can give.

There are several macros you will want to use in your tests to perform the actual testing.  Here is detailed documentation on each of these, but below the most important are summarized.

  • BOOST_REQUIRE(boolean expression) - fails if the expression evaluates to false.
  • BOOST_REQUIRE_CLOSE(float1, float2, tolerance) - compares float1 to float2, failing if the percent difference is more than the given tolerance.
  • BOOST_FAIL(string) - fail with an error message.
  • BOOST_REQUIRE_GE(val1, val2) - fail if val1 is not greater than or equal to val2. Variants exist: LE, LT, and GT. See the documentation for more information.

Note: on older versions of Boost (such as 1.39, which is what ships with RHEL5), these macros will fail if you pass differing types as parameters. For instance, if I called BOOST_REQUIRE_CLOSE(1.0, 1, 1e-5), it would fail because the second 1 is interpreted as an int not a double. #78 has a little more information on this problem, and yes, it is a stupid problem which is fixed in newer versions of Boost. However, we have to support RHEL5, so we need to worry about this (at least until RHEL6 is deployed to the lab).

What It Gets Us

A unified testing framework, coupled with effective test coverage (meaning that everything in our library is tested) allows us to see at a glance that everything in MLPACK works. No longer do we have to wander in the dark hoping that all-nearest-neighbors actually correctly calculates the nearest neighbors. Now, we know this for sure. So, as you can see, the Boost Unit Test Framework gets us quite a lot.

The IO parameter system (#65)

The Old

The FX system, a complex input parameter system which was written in C and not C++, was the old way of getting input parameters. The system offered a lot of functionality but the general usage boiled down to this simple example:

extern fx_module_doc doc; // We'll talk about this later...

int main(int argc, char** argv) {
  fx_module* root = fx_init(argc, argv, doc);

  const char* some_param = fx_param_str(root, "some_param");
  const char* some_required_param = fx_param_str_req(root, "some_required_param");

  int some_number = 0;
  if (fx_param_exists(root, "some_flag"))
    some_number = fx_param_int(root, "some_int");

  // ...

  fx_done(root);
}

This is fairly straightforward; the parameters are accessed using the functions fx_param_<type>() and fx_param_<type>_req() depending on whether or not the parameters are required. The user calls the program like this (or similar to this):

$ ./program --some_param="some string" --some_required_param="some required string" --some_flag --some_number=5

A more complex usage of the FX system exploits its capability to use hierarchical parameters. Because machine learning methods tend to depend on other methods and components, it is useful to not need to specify the same argument at several different levels. For instance, the dual-tree AllkNN calculation has parameters for the all-nearest-neighbors calculation (the selection of k) and also for the tree-building process (the leaf size). So, we can define options for the tree-building process separately from the all-nearest-neighbors options. The user will then call the program like this (or similarly):

$ ./allknn --allknn/k=7 --tree/leaf_size=20

The last important feature of FX was its timers. A timer could be used like this:

fx_timer_start("timer_name");

// Do stuff...

fx_timer_stop("timer_name");

The results of the timer would be reported at the end of execution.

Documentation of the various options is achieved by creating a documentation structure, which may look like this (taken from emst_main.cc):

const fx_entry_doc emst_entries[] = {
  {"input_filename", FX_REQUIRED, FX_STR, NULL,
   "Input dataset (CSV or ARFF)\n"},
  {"output_filename", FX_PARAM, FX_STR, NULL,
   "Filename to output spanning tree into (default output.csv)\n"},
  {"do_naive", FX_PARAM, FX_BOOL, NULL,
   "Whether or not to also perform a naive computation and compare the results\n"
   "   (default N)\n"},
  {"naive_output_filename", FX_PARAM, FX_STR, NULL,
   "Filename to output spanning tree generated with naive algorithm into (use\n"
   "   with --do_naive=Y (default naive_output.csv)\n"},
  FX_ENTRY_DOC_DONE
};

const fx_submodule_doc emst_subdoc[] = {
  {"dtb", &dtb_doc,
  "Parameters for the dual-tree Boruvka algorithm\n"},
  FX_SUBMODULE_DOC_DONE
};

const fx_module_doc emst_doc = {
  emst_entries, emst_subdoc,
  "This is the MLPACK implementation of the dual-tree Boruvka algorithm for\n"
  "finding a Euclidian Minimum Spanning Tree.  The input dataset is specified\n"
  "and the output, which is the minimum spanning tree represented as an edge list,\n"
  "will be placed into the specified output file.\n"
  "\n"
  "The dtb/leaf_size parameter gives the fastest performance with a value of 1;\n"
  "however, it may be changed to conserve memory.\n"
  "\n"
  "The output is given in the format\n"
  "  <edge lesser index> <edge greater index> <distance>\n"
  "for each edge in the minimum spanning tree.\n"
};

The user must specify the given submodules of the program. These objects must be created, and the fx_module_doc object is the one that must be passed to fx_init(). If it is not created, there is no documentation for the program. We should point out that this style of documentation is tedious and time-consuming to get right.

The New

The IO system is meant to replace the FX system, using simpler syntax and a more intuitive API. IO also supports hierarchical parameters, albeit in a slightly different manner. First, here are the three IO functions a developer should ever need to use for input and output parameters:

  • IO::ParseCommandLine(int argc, char** argv) - pass command line parameters to IO; this should be done at the top of main().
  • IO::HasParam(const char* key) - returns a bool indicating whether or not the user passed the parameter 'key'.
  • IO::GetParam<type>(const char* key) - returns a reference to what the user passed for parameter 'key'; because it is a reference, it can be modified (so, an IO::SetParam() method is not necessary).

Here are the two functions relating to timers:

  • IO::StartTimer(const char* name) - start the timer with the given name.
  • IO::StopTimer(const char* name) - stop the timer with the given name.

The fx_done() function has no equivalent as this is not necessary in IO. This is because the IO object is a singleton, which is created before anything else in the program and deleted last. However, there is no way to automatically initialize the object with the program's argc and argv parameters, so the call to IO::ParseCommandLine() is still always necessary.

The last important functions of IO are the functions which define the input parameters and document them. Unlike FX, you *must* define the parameter for it to be able to be used. One advantage is that we don't end up with undocumented programs, like we did before. Unfortunately, the best solution was for these functions to be macros. This goes against the design decision of reducing the number of macros we use, but there were zero appealing alternatives. The macros are listed below.

  • PARAM_FLAG(name, description, parent) (a flag is a boolean parameter; if it is given, it is true)
  • PARAM_INT(name, description, parent, default_value)
  • PARAM_DOUBLE(name, description, parent, default_value)
  • PARAM_STRING(name, description, parent, default_value)
  • PARAM_INT_REQ(name, description, parent)
  • PARAM_DOUBLE_REQ(name, description, parent)
  • PARAM_STRING_REQ(name, description, parent)
  • PARAM_MODULE(name, description)
  • PROGRAM_INFO(name, description)

Each of these macros has a parent parameter; this is the IO implementation of hierarchical parameters. The string given to parent is prepended to the name to produce what the user should type to pass the parameter. For instance, if the program contained

PARAM_FLAG("flag1", "This flag does something.", "flags");
PARAM_FLAG("flag2", "This flag does something else.", "flags");

then the user would pass that flag like this:

$ ./program --flags/flag1 --flags/flag2

The PARAM_MODULE() and PROGRAM_INFO() macros are purely for program documentation; they do not define any parameters. Here is an example which elaborates on the previous example (it is not a very useful example):

PROGRAM_INFO("Happy Flag Program", "This program does something and you can pass two flags to it if you want.");

PARAM_MODULE("flags", "Parameters which control the flags.");

PARAM_FLAG("flag1", "This flag does something.", "flags");
PARAM_FLAG("flag2", "This flag does something else.", "flags");

Then, we can see the program's documentation:

$ ./program --help
Happy Flag Program

  This program does something and you can pass two flags to it if
  you want.

Allowed options:

'flags' module:
  Parameters which control the flags.

  --flags/flags1            This flag does something.
  --flags/flags2            This flag does something else.

Other options:

  --help                    Default help info.
  --info [string]           Get help on a specific module or option.
  --verbose                 Display informational messages and the full list
                            of parameters and timers at the end of execution.
$

As an important note, you do not need to include newline characters in your documentation strings. IO will automatically work out the formatting, to make it consistent across all MLPACK executables.

Now, another important aspect of the IO system is that defining all the various submodules which will be documented explicitly is not necessary. Before, in FX, a developer would define these by correctly linking the fx_module_doc objects together in an fx_submodule_doc object explicitly. Now, it is simply a matter of including the right files.

For instance, let's take the example of Maximum Variance Unfolding (MVU). This method depends on the Augmented Lagrangian optimizer, which in turn uses the L-BFGS optimizer. All three of these objects have individual parameters. The parameters for MVU are defined in mvu.h:

#include <fastlib/fastlib.h>
#include <fastlib/optimization/aug_lagrangian/aug_lagrangian.h>

PARAM_MODULE("mvu", "Parameters for Maximum Variance Unfolding.");

PARAM_INT_REQ("new_dim", "Dimensionality of unfolded dataset.", "mvu");
PARAM_FLAG("pca_preprocess", "If set, the dataset will be preprocessed using PCA.", "mvu");

// The MVU class is written below...

Now, if we look in aug_lagrangian.h, we see:

#include <fastlib/fastlib.h>
#include <fastlib/optimization/lbfgs/lbfgs.h>

PARAM_MODULE("aug_lagrangian", "Parameters for the Augmented Lagrangian optimization algorithm.");

PARAM_DOUBLE("sigma", "Initial value of penalty parameter sigma.", "aug_lagrangian", 0.5);
PARAM_INT("max_iterations", "Maximum number of iterations.", "aug_lagrangian", 1000);

// The AugLagrangian class is written below...

Then, we can look in lbfgs.h and see this:

#include <fastlib/fastlib.h>

PARAM_MODULE("lbfgs", "Options for the L-BFGS optimizer, which uses a "
    "back-tracing line search to determine the step size to take.");

PARAM_DOUBLE("armijo_constant", "Controls the accuracy of the line search "
    "routine for determining the Armijo condition.", "lbfgs", 1e-4);
PARAM_DOUBLE("min_step", "The minimum step of the line search.", "lbfgs",
    1e-20);
PARAM_DOUBLE("max_step", "The maximum step of the line search.", "lbfgs",
    1e20);
PARAM_INT("max_line_search_trials", "The maximum number of trials for the line "
    "search.", "lbfgs", 50);
PARAM_DOUBLE("wolfe", "Parameter for detecting the Wolfe condition.", "lbfgs",
    0.9);
PARAM_DOUBLE("min_gradient_norm", "Minimum gradient norm required to continue "
    "the optimization.", "lbfgs", 1e-10);

Now, because mvu.h includes aug_lagrangian.h (which in turn includes lbfgs.h), all of these parameters are correctly added and documented. If we were to run the executable including those header files, asking for help, we would get:

$ ./mvu --help
Maximum Variance Unfolding

  (Explanation of MVU...)

Allowed options:

'aug_lagrangian' module:
  Parameters for the Augmented Langrangian optimization algorithm.

  --aug_lagrangian/max_iter [int]
                              Maximum number of iterations.
  --aug_lagrangian/sigma [double]
                              Initial value of the penalty parameter sigma.

'lbfgs' module: 
  Options for the L-BFGS optimizer, which uses a back-tracing line search to
  determine the step size to take.
  

  --lbfgs/armijo_constant [double]
                              Controls the accuracy of the line search routine
                              for determining the Armijo condition.
                              
  --lbfgs/max_line_search_trials [int]
                              The maximum number of trials for the line search.
  --lbfgs/max_step [double]   The maximum step of the line search.
  --lbfgs/min_gradient_norm [double]
                              Minimum gradient norm required to continue the
                              optimization.
                              
  --lbfgs/min_step [double]   The minimum step of the line search.
  --lbfgs/wolfe [double]      Parameter for detecting the Wolfe condition.

'mvu' module:
  Parameters for Maximum Variance Unfolding.

  --mvu/new_dim [int]         Dimensionality of unfolded dataset.
  --mvu/pca_preprocess        If set, the dataset will be preprocessed using
                              PCA.

Other options:

  --help                    Default help info.
  --info [string]           Get help on a specific module or option.
  --verbose                 Display informational messages and the full list
                            of parameters and timers at the end of execution.
$

Therefore, we have included all the documentation and parameters for the pieces of our machine learning method merely by including the header files. As a result, when you write an algorithm, the parameters that will be used to control it from the command line should be put into the header file of the algorithm's class (i.e. for MVU we put the parameter definitions into mvu.h, not mvu_main.cc).