rfdavid

Exploring Kùzu Graph Database Management System code

2023-02-22T00:00:00+00:00

Introduction

Kùzu is a Graph Database Manaagement System born after extensive research conducted over several years at University of Waterloo. Kùzu is highly optimized to handle complex join-heavy analytical workloads on very large databases. It is similar to what DuckDB is doing for SQL. It is extremely useful when you need to model your data as a graph from different sources and store it in one place for fast extraction in analytics. Kùzu has integration with Pytorch Geometric, making it easy to extract graph data and feed it into your PyG models to perform a GNN task. This article contains my annotations from when I started exploring how Kùzu database works. I took a ‘depth limited search’ approach exploring the code by first going to the CLI and running a simple query. I used LLDB to debug and learn more about the overall design of the database.

Starting from the embedded shell

Starting from the CLI tool, the purpose is to track what is happening internally from the initialization to a match query.

Kùzu uses args library to parse the arguments. #include "args.hxx". For instance, database path (-i parameter) can be retrieved by:

auto databasePath = args::get(inputDirFlag);
uint64_t bpSizeInMB = args::get(bpSizeInMBFlag);

Initialize default bufferPoolSize as -1u bit mask: uint64_t bpSizeInBytes = -1u;.

SystemConfig

shell_runner.cpp: SystemConfig systemConfig(bpSizeInBytes);

SystemConfig will initialize 4 variables:

systemMemSize: total memory in the system. This is accomplished by mutiplying the number of pages of physical memory by the size of a page in bytes. Both values are retrieved using sysconf from unistd.h library.

database.cpp:
   24           auto systemMemSize =
-> 25               (std::uint64_t)sysconf(_SC_PHYS_PAGES) * (std::uint64_t)sysconf(_SC_PAGESIZE);

(lldb) p systemMemSize
(unsigned long long) $9 = 34359738368

_SC_PHYS_PAGES : the number of pages of physical memory
_SC_PAGESIZE : size of a page in bytes

bufferPoolSize: defined by the system memory or UINTPTR_MAX x default pages buffer ratio. UINTPTR_MAX is the larges value uintptr_t can hold. StorageConfig is located at include/common/configs.h and contains the struct with many default values used by the application.

-> 26           bufferPoolSize = (uint64_t)(StorageConfig::DEFAULT_BUFFER_POOL_RATIO *
   27                                       (double_t)std::min(systemMemSize, (std::uint64_t)UINTPTR_MAX));

defaultPageBufferPoolSize and largePageBufferPoolSize: the bufferPoolSize multiplied by the ratio defined for default pages and large pages.

   29       defaultPageBufferPoolSize =
-> 30           (uint64_t)((double_t)bufferPoolSize * StorageConfig::DEFAULT_PAGES_BUFFER_RATIO);
   31       largePageBufferPoolSize =
   32           (uint64_t)((double_t)bufferPoolSize * StorageConfig::LARGE_PAGES_BUFFER_RATIO);

include/common/configs.h:
struct StorageConfig {
    // The default ratio of system memory allocated to buffer pools (including default and large).
    static constexpr double DEFAULT_BUFFER_POOL_RATIO = 0.8;
    // The default ratio of buffer allocated to default and large pages.
    static constexpr double DEFAULT_PAGES_BUFFER_RATIO = 0.75;
    static constexpr double LARGE_PAGES_BUFFER_RATIO = 1.0 - DEFAULT_PAGES_BUFFER_RATIO;
    ... (omitted)
};

(lldb) p largePageBufferPoolSize/(1024*1024*1024)
(unsigned long long) $28 = 6
(lldb) p defaultPageBufferPoolSize/(1024*1024*1024)
(unsigned long long) $29 = 19

maxNumThreads: the number of concurrent threads supported by the available hardware. This number is only a hint and might not be accurate.

(lldb) p maxNumThreads
(uint64_t) $30 = 12

Embedded Shell

Initialize an instance of EmbddedShell (tools/shell/embedded_shell.cpp):

tools/shell/shell_runner.cpp:
-> 33           auto shell = EmbeddedShell(databasePath, systemConfig);

tools/shell/embedded_shell.cpp:

   201  EmbeddedShell::EmbeddedShell(const std::string& databasePath, const SystemConfig& systemConfig) {
-> 202      linenoiseHistoryLoad(HISTORY_PATH);
   203      linenoiseSetCompletionCallback(completion);
   204      linenoiseSetHighlightCallback(highlight);
   205      database = std::make_unique<Database>(databasePath, systemConfig);
   206      conn = std::make_unique<Connection>(database.get());
   207      updateTableNames();
   208  }

Initialize the embedded shell using the databasePath from the parameter and also the systemConfig previously defined:

(lldb) p systemConfig
(const kuzu::main::SystemConfig) $31 = {
  defaultPageBufferPoolSize = 20615843020
  largePageBufferPoolSize = 6871947673
  maxNumThreads = 12
}

linenoise is a lightweight library for editing line, providing useful functionalities such as single and multi line editing mode, history handling, completion, hints as you type, among others. It is used in Redis, MongoDB and Android. The library is embedded in the codebase (tools/shell/linenoise.cpp). I won’t get into the details of linenoise configuration.

-> 205      database = std::make_unique<Database>(databasePath, systemConfig);
-> 206      conn = std::make_unique<Connection>(database.get());

database and conn are both defined in embedded_shell.h:

private:
    std::unique_ptr<Database> database;
    std::unique_ptr<Connection> conn;
};

Line 205 and 206 define the database and get the current connection, respectively. Before getting into connection in the next section, I’ll take a look at the updateTableNames(), since now we are dealing with catalogue to read the database schema.

updateTableNames()

There are two type of tables: node and relations. updateTableNames will store the table names for both by fetching from database->catalog. In my database, I have “person” and “animal” node tables and “hasOwner” and “knows” relations tables:

tools/shell/embedded_shell.cpp:

   67   void EmbeddedShell::updateTableNames() {
   68       nodeTableNames.clear();
   69       relTableNames.clear();
-> 70       for (auto& tableSchema : database->catalog->getReadOnlyVersion()->getNodeTableSchemas()) {
   71           nodeTableNames.push_back(tableSchema.second->tableName);
   72       }
   73       for (auto& tableSchema : database->catalog->getReadOnlyVersion()->getRelTableSchemas()) {
   74           relTableNames.push_back(tableSchema.second->tableName);
   75       }
   76   }

lldb output:

(lldb) p nodeTableNames
(std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >) $41 = size=2 {
  [0] = "person"
  [1] = "animal"
}
(lldb) p relTableNames
(std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >) $42 = size=2 {
  [0] = "hasOwner"
  [1] = "knows"
}

Connection (src/main/connection.cpp)

Connection is used to interact with a Database instance, and each Connection is thread-safe. Multiple connections can connect to the same Database instance in a multi-threaded environment. The description of the API below was extracted from src/include/main/connection.h:

Creates a connection to the database.

KUZU_API explicit Connection(Database* database);

Destructor

KUZU_API ~Connection();

Manually starts a new read-only transaction in the current connection.

KUZU_API void beginReadOnlyTransaction();

Manually starts a new write transaction in the current connection.

KUZU_API void beginWriteTransaction();

Manually commits the current transaction.

KUZU_API void commit();

Manually rollbacks the current transaction.

KUZU_API void rollback();

Sets the maximum number of threads to use for execution in the current connection.

KUZU_API void setMaxNumThreadForExec(uint64_t numThreads);

Returns the maximum number of threads to use for execution in the current connection.

KUZU_API uint64_t getMaxNumThreadForExec();

Executes the given query and returns the result.

KUZU_API std::unique_ptr<QueryResult> query(const std::string& query);

Prepares the given query and returns the prepared statement.

KUZU_API std::unique_ptr<PreparedStatement> prepare(const std::string& query);

Executes the given prepared statement with args and returns the result.

KUZU_API template<typename... Args>
inline std::unique_ptr<QueryResult> execute(
    PreparedStatement* preparedStatement, std::pair<std::string, Args>... args) {
    std::unordered_map<std::string, std::shared_ptr<common::Value>> inputParameters;
    return executeWithParams(preparedStatement, inputParameters, args...);
}

Executes the given prepared statement with inputParams and returns the result.

KUZU_API std::unique_ptr<QueryResult> executeWithParams(PreparedStatement* preparedStatement,
    std::unordered_map<std::string, std::shared_ptr<common::Value>>& inputParams);

Return all node table names in string format.

KUZU_API std::string getNodeTableNames();

Return all rel table names in string format.

KUZU_API std::string getRelTableNames();

Return the node property names.

KUZU_API std::string getNodePropertyNames(const std::string& tableName);

Return the relation property names.

KUZU_API std::string getRelPropertyNames(const std::string& relTableName);

If you wondering what is behind KUZU_API, the datatype is defined in src/include/common/types/types.h:

KUZU_API enum DataTypeID : uint8_t {
    ANY = 0,
    NODE = 10,
    REL = 11,

    // physical types

    // fixed size types
    BOOL = 22,
    INT64 = 23,
    DOUBLE = 24,
    DATE = 25,
    TIMESTAMP = 26,
    INTERVAL = 27,

    INTERNAL_ID = 40,

    // variable size types
    STRING = 50,
    LIST = 52,
};

Starting from C++ API

I will now explore COPY command from the C++ API by using the existing example from examples/cpp/main.cpp. To compile, you just have to add add_subdirectory(examples/cpp) inside CMakeLists.txt and run make test or make debug. The example will be compiled and available at build/debug/examples/cpp or build/release/examples/cpp depending on the make parameter used to compile.

main.cpp

#include <iostream>

#include "main/kuzu.h"
using namespace kuzu::main;

int main() {
    auto database = std::make_unique<Database>("/tmp/db");
    auto connection = std::make_unique<Connection>(database.get());

    connection->query("CREATE NODE TABLE tableOfTypes (id INT64, int64Column INT64, doubleColumn DOUBLE, booleanColumn BOOLEAN, dateColumn DATE, timestampColumn TIMESTAMP, stringColumn STRING, PRIMARY KEY (id));");
    connection->query("COPY tableOfTypes FROM \"/Users/rfdavid/Devel/waterloo/kuzu/dataset/copy-test/node/csv/types_50k.csv\" (HEADER=true);");
}

This example created a node table named tableOfTypes (from copy-test schema) and use the command COPY to import 50k rows from types_50k.csv file.

I will start debugging by adding a breakpoint before the COPY command:

(lldb) b main.cpp:11
Breakpoint 1: where = example-cpp`main + 136 at main.cpp:11:5, address = 0x0000000100003c80
(lldb) r
Process 59055 launched: '/Users/rfdavid/Devel/waterloo/kuzu/build/debug/examples/cpp/example-cpp' (arm64)
Process 59055 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x0000000100003c80 example-cpp`main at main.cpp:11:5
   1   	#include <iostream>
   2
   3   	#include "main/kuzu.h"
   4   	using namespace kuzu::main;
   5
   6   	int main() {
   7   	    auto database = std::make_unique<Database>("/tmp/db");
   8   	    auto connection = std::make_unique<Connection>(database.get());
   9
   10  	    connection->query("CREATE NODE TABLE tableOfTypes (id INT64, int64Column INT64, doubleColumn DOUBLE, booleanColumn BOOLEAN, dateColumn DATE, timestampColumn TIMESTAMP, stringColumn STRING, PRIMARY KEY (id));");
-> 11  	    connection->query("COPY tableOfTypes FROM \"/Users/rfdavid/Devel/waterloo/kuzu/dataset/copy-test/node/csv/types_50k.csv\" (HEADER=true);");
   12  	}
Target 0: (example-cpp) stopped.

Inside Connection::query, a mutex lock is set, a preparedStatement will be created and executed through executeAndAutoCommitIfNecessaryNoLock.

   76  	std::unique_ptr<QueryResult> Connection::query(const std::string& query) {
   77  	    lock_t lck{mtx};
-> 78  	    auto preparedStatement = prepareNoLock(query);
   79  	    return executeAndAutoCommitIfNecessaryNoLock(preparedStatement.get());
   80  	}

A prepared statement is a parameterized query used to avoid repeated execution of the same query. prepareNoLock will go through the following steps: parsing, binding, planning and optmizing and then return a PreparedStatement object to Connection::query.

Influence Functions in Machine Learning

2022-08-31T13:00:00+00:00

Introduction

With the increasing complexity of machine learning models, the generated predictions are not easily interpretable by humans and are usually treated as black-box models. To address this issue, a rising field of explainability try to understand why those models make certain predictions. In recent years, the work by [1] has attracted a lot of attention in many fields, using the idea of influence functions [2] to identify the most responsible training points for a given prediction.

Robust Statistics

Statistical methods rely explicitly or implicitly on assumptions based on the data analysis and the problem stated. The assumption usually concerns the probability distribution of the dataset. The most widely used framework makes the assumption that the observed data have a normal (Gaussian) distribution, and this classical statistical method has been used for regression, analysis of variance and multivariate analysis. However, real-life data is noisy and contain atypical observations, called outliers. Those observations deviate from the general pattern of data, and classical estimates such as sample mean and sample variance can be highly adversely influenced. This can result in a bad fit of data. Robust statistics provide measures of robustness to provide a good fit for data containing outliers [3].

Influence Functions

The Influence Functions (IF) was first introduced in “The Influence Curve and Its Role in Robust Estimation” [2], and measures the impact of an infinitesimal perturbation on an estimator. The very interesting work by [1] brought this methodology into machine learning.

Influence Functions in Machine Learning

Consider an image classification task where the goal is to predict the label for a given image. We want to measure the impact of a particular training image on a testing image. A naive approach is to remove the image and retrain the model. However, this approach is prohibitively expensive. To overcome this problem, influence function upweight that particular point by an infinitesimal amount and measure the impact in the loss function without having to train the model.

Figure 1: The fish image is upweighted by an infinitesimal amount so the model try harder to fit that particular sample. Image by the author.

Change in Parameters

The empirical risk minimizer to solve an optimization problem can be defined as the following:

\[\begin{equation} \hat\theta = arg \; \underset{\theta}{min} \frac{1}{n} \sum_{i=1}^{n} \mathcal{L}(z_i, \theta) \end{equation}\]

Where \(z_i\) is each training point from a training sample. First, we need to understand how the parameters \(\hat\theta\) change after perturbing a particular training point \(z\) by an infinitesimal amount \(\epsilon\), defined by \(\theta - \hat\theta\) where \(\theta\) is the original parameters for the full training data and \(\hat\theta\) is the new set of parameters after upweighting:

\[\begin{equation} \hat\theta_{\epsilon,z} = arg \; \underset{\theta}{min} \frac{1}{n}\sum_{i=1}^{n}\mathcal{L}(z_i,\theta) + \epsilon \mathcal{L}(z,\theta) \end{equation}\]

As we want to measure the rate of change of the parameters after perturbing the point, the derivation made by [4] yields the following:

\[\begin{equation} I(z) = \frac{d\hat\theta_{\epsilon,z}}{d\epsilon} \bigg|_{\epsilon=0} = -H_{\hat\theta}^{-1}\nabla_{\theta} \mathcal{L}(z,\hat\theta) \end{equation}\]

Where \(H_{\hat\theta}\) is the Hessian matrix and assumed to be positive definite (symmetric with all positive eigenvalues), which can be calculated by \(\frac{1}{n}\sum_{i=1}^n \nabla_{\theta}^2 \mathcal{L}(z_i,\hat\theta)\).

The equation \(3\) gives the influence of a single training point z on the parameters \(\theta\). When multiplying \(-\frac{1}{n} I(z)\) the result is similar as removing \(z\) and re-training the model.

Change in the Loss Function

As we want to measure the change in the loss function for a particular testing point, applying chain rule gives the following equation:

\[\begin{equation} I(z, z_{test}) = \frac{d L(z_{test},\hat\theta_{\epsilon, z})}{d\epsilon} \bigg|_{\epsilon=0} = -\nabla_\theta \mathcal{L}(z_{test},\hat\theta)^T H_{\hat\theta}^{-1} \nabla_\theta \mathcal{L}(z,\hat\theta) \end{equation}\]

\(\frac{1}{n} I(z, z_{test})\) approximately measures the impact of \(z\) on \(z_{test}\). This is based on the assumption that the underlying loss function is strictly convexa continuous function whose value at the midpoint of every interval in its domain does not exceed the arithmetic mean of its values at the ends of the interval. Usually, a loss function is considered to be convex. in the parameters \(\theta\). Some loss functions are not differentiable (hinge loss), so in this case, one of the contributions of Koh’s work is to approximate to a differentiable region right at the margin.

Influence Functions on Groups

As previously seen, the influence functions measure the impact of a training point in a single testing point. They are based on first-order Taylor approximationa function becomes “better” as n increases in the Taylor series., which is fairly accurate for small changes. In order to study the effect of a large group of training points, [5] analyze this phenomenon where influence functions can be used for some particular cases. It can be written as the sum of the influences of individual points in a group:

\[\sum_{i=1}^n I(z_i, z_{test}) = -\nabla_\theta \mathcal{L}(z_{test},\hat\theta)^T H_{\hat\theta}^{-1} \sum_{i=1}^n \nabla_\theta \mathcal{L}(z,\hat\theta)\]

Given a group \(\mathcal{U}\) and \(I(\mathcal{U})^{(1)}\) the first-order group influence, [6] proposes second-order group influence function to capture informative cross-dependencies among samples:

\[I(\mathcal{U})^{2} = I(\mathcal{U})^{(1)} + I(\mathcal{U})^{'}\]

Hence, first-order group influence function \(I(\mathcal{U})^{(1)}\) can be defined as:

\[I(\mathcal{U})^{(1)} = \frac{\partial \theta_{\mathcal{U}}^{\epsilon}}{\partial \epsilon} \bigg|_{\epsilon=0}\]

And the second-order group influence \(I(\mathcal{U})^{'}\) as:

\[I(\mathcal{U})^{(1)} = \frac{\partial^2 \theta_{\mathcal{U}}^{\epsilon}}{\partial \epsilon^2} \bigg|_{\epsilon=0}\]

This technique was empirically proven that can be used to improve the selection of the most influential group for a test sample across different group sizes and types. The idea is to capture more information when the changes to the underlying model are relatively large.

The Calculation Bottleneck

Computing the inverse hessian is quite expensive and infeasible for a network with lots of parameters. In numpy, it can be calculated using numpy.linalg.inv. As a side note, numpy is mostly written in c and the high-level functions are python bindings. Nevertheless, it is still an expensive function. In PyTorch framework, you can compute the Hessians using torch.autograd.functional.hessian and then inversing it with torch.linalg.inv. I’m going to expand a little bit here using examples because this is a bit tricky. The module nn.torch contains different classes that provides useful methods for models that inherit nn.Module.

funcional modules takes NN modules and turn them in purely functional stateless so you can explicitely pass parameters to a function.

torch.autograd.functional requires to pass the paramenter to a function (see the long discussion here).

Conjugate Gradients

Conjugate gradient [7] is an iterative method for solving large systems of linear equations, and it is effective to solve systems in the form of \(Ax = b\). In [8], the hessian is calculated by approximation using second-order optimization technique. This method does not invert the hessian directly but calculate the inverse hessian product:

\[H^{-1} v = arg min_{t}(t^T Ht - v^Tt)\]

Linear Time Stochastic Second-Order Algorithm (LiSSA)

The main idea of LiSSA [9] is to use Taylor expansion (Neumann series) to construct a natural estimator of the inverse Hessian:

\[H^{-1} = \sum^{\infty}_{i=0} (I - H)^i\]

Rewriting this equation recursively, as \(\lim_{j \to \infty} H_{j}^{-1} = H^{-1}\), we have the following:

\[H_{j}^{-1} = \sum^{j}_{i=0} (I - H)^i = I + (I - H) H^{-1}_{j-1}\]

FastIF

In order to improve the scalability and computational cost, FastIF [10] present a set of modifications to improve the runtime. The work uses k-neareast neighbours to narrow the search space down, which can be inexpensive for this context since i k-nn is a lazy learnerit doesn’t learn a discriminative function from the training data, but only store the dataset.) algorithm.

The Problem with Influence Functions

Influence functions are an approximation and do not always produce correct values. In some particular settings, influence functions can have a significant loss in information quality. It is known to work with convex loss functions, but for non-convex setups, the estimations can not work as expected. The work ‘Influence Functions in Deep Learning are Fragile’ [11] examines the conditions where influence estimation can be applied to deep networks through vast experimentation. In short, there are a few obstacles:

The estimation in deeper architectures is erroneous, possibly due to poor inverse hessian estimation. Weight-decay regularization can help.
Wide networks perform poorly. When increasing the width of a network, the correlation between the true difference in the loss and the influence function decreases substantially.
Scale influence functions is challenging. ImageNet contains 1.2 million images in the training set, being difficult to evaluate if influence functions are effective since it is computationally prohibitive to re-train the model multiple times, leaving each training point out of the training.

Libraries

There are several implementations available in Python with PyTorch and TensorFlow. A few others are built on R and Matlab.

Influence Functions
The official version of [1] built on TensorFlow.

Influence Functions for PyTorch
PyTorch implementation. It uses stochastic estimation to calculate the influence.

Torch Influence
A recent implementation (Jul/2022) of influence functions on PyTorch, providing three different ways to calculate the inverse hessian: direct computation and inversion with torch.autograd, truncated conjugate gradients and LiSSA.

Fast Influence Functions
A modified influence function computation using k-Nearest Neighbors (kNN), implemented in PyTorch.

Other implementations

Influence Function with LiSSA
A simple implementation with LiSSA on TensorFlow.

Influence Pytorch One-file code with the implementation for a random classification problem.

IF notebook
Python notebook with IF applied to other algorithms (Trees, Ridge RegressionMethod to estimate the coefficients of multiple regression models where the independent variables are highly correlated.).

Influence Functions Pytorch
Another implementation of influence functions.

Applications

Explainability: This is the most common use we explored so far, measuring the impact of a training point to explain the impact in a given testing point.
Adversarial Attacks: Real-world data is noisy, and it can be problematic for machine learning. Adversarial machine learning methods are methods used to feed a model with deceptive input, changing the predictions of a classifier. Influence functions can help by identifying how to modify a training point to increase the loss in a target point.
Label mismatch: Toy datasets are pretty good for experimentation, but real data might contain many mislabeled examples. The idea is to calculate the influence of a particular training point \(I(z_{i}, z_{i})\) if that point was removed. Email spam is a good example since it usually uses the user’s input in classifying whether an email is spam or not.

Conclusion

The very interesting work from [1] brought influence functions to the context of machine learning. In principle, this technique was introduced more than 40 years ago by [2]. One of the main contributions is how to apply to non-differentiable loss functions (i.e. hinge loss). In addition to that, the paper uses other existing ideas to overcome the computation issue, such as conjugate gradients and LiSSA algorithm. Subsequent work studied influence functions on groups [5], [6]. The last used second-order influence functions to capture hidden information when the group size is relatively large. I believe this is a powerful technique that will continue to derive new ideas in many different areas. One example is in pruning, where a single-shot pruning technique was based on sensitivity connections [12], exploring the idea of perturbing weights in a network. Another idea is in the area of graphs, a popular framework JK Networks [13] uses perturbation analysis to measure what is the impact of a change in one node embedding in another node embedding.

References

[1]P. W. Koh and P. Liang, “Understanding Black-box Predictions via Influence Functions,” in Proceedings of the 34th International Conference on Machine Learning, 2017, vol. 70, pp. 1885–1894.
[2]F. R. Hampel, “The Influence Curve and Its Role in Robust Estimation,” Journal of the American Statistical Association, vol. 69, no. 346, pp. 383–393, 1974, Accessed: Jul. 27, 2022. [Online].
[3]R. A. Maronna, D. R. Martin, and V. J. Yohai, Robust Statistics: Theory and Methods. Wiley, 2006.
[4]R. D. Cook and S. Weisberg, Residuals and Influence in Regression . New York: Chapman and Hall, 1982.
[5]P. W. W. Koh, K.-S. Ang, H. Teo, and P. S. Liang, “On the Accuracy of Influence Functions for Measuring Group Effects,” in Advances in Neural Information Processing Systems, 2019, vol. 32.
[6]S. Basu, X. You, and S. Feizi, “On Second-Order Group Influence Functions for Black-Box Predictions,” in Proceedings of the 37th International Conference on Machine Learning, 2020, vol. 119, pp. 715–724.
[7]J. R. Shewchuk, “An Introduction to the Conjugate Gradient Method Without the Agonizing Pain,” Aug. 1994.
[8]J. Martens, “Deep Learning via Hessian-Free Optimization,” in Proceedings of the 27th International Conference on International Conference on Machine Learning, Madison, WI, USA, 2010, pp. 735–742.
[9]N. Agarwal, B. Bullins, and E. Hazan, “Second-Order Stochastic Optimization for Machine Learning in Linear Time,” Journal of Machine Learning Research, vol. 18, no. 116, pp. 1–40, 2017.
[10]H. Guo, N. Rajani, P. Hase, M. Bansal, and C. Xiong, “FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, Nov. 2021, pp. 10333–10350.
[11]S. Basu, P. Pope, and S. Feizi, “Influence Functions in Deep Learning Are Fragile,” 2021.
[12]N. Lee, T. Ajanthan, and P. Torr, “SNIP: SINGLE-SHOT NETWORK PRUNING BASED ON CONNECTION SENSITIVITY,” 2019.
[13]K. Xu, C. Li, Y. Tian, T. Sonobe, K.-ichi Kawarabayashi, and S. Jegelka, “Representation Learning on Graphs with Jumping Knowledge Networks,” in ICML, 2018, pp. 5449–5458.

Paper review - Design Space for Graph Neural Networks

2021-12-20T15:27:31+00:00

Introduction

Design Space for Graph Neural Networks [1] was published on NeurIPS 2020. The authors are Jiaxuan You, Zhitao Ying and Jure Leskovec from Stanford. There is also a very good video from the author available on YouTube. The code is also available on Github.

Instead of evaluating a specific architecture of GNNs such as GCN, GIN or GAT, the paper explores the design space in a more general way. For example, is batch normalization helpful in GNNs? This paper answer this question empirically by performing multiple experiments.

The paper takes a systematic approach to study a general design space of GNN for many different tasks, presenting three key innovations:

General GNN design space
GNN task space with a similarity metric
Design space evaluation

General GNN design space

The design space is based on three configurations: intra-layer design, inter-layer design, and learning configuration. All combined possibilities result in 314,928 different designs.

Figure 1: General design space divided into intra-layer, inter-layer and learning configuration. Image extracted from [1].

Intra-layer design follows the sequence of the modules:

\[h^{k+1}_{v} = AGG\Big(\Big\{ACT\Big(DROPOUT(BN(W^{(k)}*h_u^{(k)} + b^{(k)}))\Big) \Big\}, u \in \mathcal{N}(v)\Big)\]

It uses the following ranges:

Aggregation	Activation	Dropout	Batch Normalization
Mean, Max, Sum	ReLU, PReLU, Swish	False, 0.3, 0.6	True, False

Inter-layer design is the neural network layers:

Layer connectivity	Pre-process layers	Message passing layers	Post-precess layers
Stack, Skip-Sum, Skip-Cat	1, 2, 3	2, 4, 6, 8	1, 2, 3

Training configuration is the configuration:

Batch size	Learning rate	Optmizer	Training epochs
16, 32, 64	0.1, 0.01, 0.001	SGD, Adam	100, 200, 400

I believe some of the properties selected above should not be labelled as architecture (i.e. learning rate, epochs). The talk by Ameet Talkwalkar well address the difference between hyper-parameter search and neural architecture search. Hyperparameter search starts assuming you have a fixed neural network backbone, and then there are certain properties that you want to tune. Some properties are architectural and others non-architectural:

Architectural: nodes per layer, number of layers, activation function
Non-architectural: regularization, learning rate, batch size

In NAS, you ignore the non-architectural parameters, and you also consider layer operations and networks connections in the architectural setting. Hyperparameter is the entire space to build your network, whereas neural architecture search is limited by a defined design space.

GNN task space with a similarity metric

The paper developed a technique to measure and quantify the GNN task space in conjunction with the design space. This is the most interesting idea from this paper, in my opinion, and could spawn other promising ideas. They collect 32 synthetic and real-world GNN tasks/datasets and use Kendall rank correlation [2] to compare an evaluated task to a new task. The finding is very interesting: similar tasks perform well using similar configurations, and the inverse is true. The implication is the possibility of transferring the configuration from one known task to a new task/dataset.

The example below demonstrates two different tasks, A and B. A controlled random search is applied to find the best design performance for each task. In this example, task A performed better using sum aggregation function, whereas task B performed better using max aggregation function. The question is if it’s possible to use the same configuration to a new similar task based on similarity.

Table 1: Image extracted from [1]

Once introducing a new target task (ogbg-molhiv in the example), a task similarity is calculated. Task A has a correlation of 0.47, and Task B has a negative correlation of -0.61. When testing both configurations from A and B to the new task, the performance was significantly better using Task A design which has a high correlation with the target task.

Design space evaluation

The evaluation of design space alongside all the tasks lead to over 10 million possible combinations. A controlled random search is proposed to explore this space. It basically randomly sample 96 setups out of the 10M possibilities, control the configuration to be tested and evaluated. For example, consider batch normalization as the target study. A sample of 96 different configurations is randomly sampled among the design space. Batch normalization is set to True and evaluated. By preserving the other parameters, batch normalization is set to False and then evaluated again. The results are ranked by performance to generate a distribution, and the frequency is used to analyze whether batch normalization is generally helpful or not.

Experiments and Results

The paper show a nice visualization using violin plot for the experiments.

Figure 3: Boxplot of the results. Image extracted from [1]

Each plot represents the distribution of the rank. For example, the first graph is the distribution of the experiments for batch normalization. By evaluating different architectures randomly, when setting batch normalization to True, it ranked better (lower is better), indicating that in most cases, the GNN will perform better when this property is used. The most expressive configurations found in this paper are:

Dropout node feature is not effective.
PReLU stands out as the choice of activation.
Sum aggregation is the most expressive.
There is no definitive conclusion for the number of message passing layers, pre-processing layers or pos-processing layers.
Skip connections are generally favorable.
Batch size of 32 is a safer choice, as learning rate of 0.01.
ADAM resulted in better performance than SGD.
More epochs of training lead to better performance.

References

[1]J. You, R. Ying, and J. Leskovec, “Design Space for Graph Neural Networks,” 2020.
[2]H. Abdi, The kendall rank correlation coefficient. Encyclopedia of Measurement and Statistics., 2007.

What is this about

2021-10-16T17:27:31+00:00

I have lately dedicated a good amount of my time to input but little to generating output. Writing about what you learn is an efficient way to ask yourself if you really know what you are supposed to know. Furthermore, it is very challenging to write clearly and concisely, and I hope I can use this blog to improve my brain’s model parameters to write better. I will mainly use the posts as annotations, probably editing and adding more information as I learn. The main topic is machine learning focused on graphs, where I have been dedicating most of my time. Feel free to contact me at rui.david ontariotechu.net.