Jekyll2025-07-20T17:52:45+00:00https://nolte.dev/feed.xmlls ~niklas/RS @Anthropic, previously @Meta AI (FAIR), @MIT and @CERNHow to train your Lipschitz Network2022-02-13T00:00:00+00:002022-02-13T00:00:00+00:00https://nolte.dev/jekyll/update/2022/02/13/training-lipschitz-networks MathJax.Hub.Config({ TeX: { equationNumbers: { autoNumber: "AMS" } }, tex2jax: { inlineMath: [ ['$', '$'] ], displayMath: [ ['$$', '$$'] ], processEscapes: true, } }); MathJax.Hub.Register.MessageHook("Math Processing Error",function (message) { alert("Math Processing Error: "+message[1]); }); MathJax.Hub.Register.MessageHook("TeX Jax - parse error",function (message) { alert("Math Processing Error: "+message[1]); });

What are Lipschitz Networks?

In my endeavours to try to make AI more safe and understandable, I make use of neural networks which I call “Lipschitz Networks” (even though this term is not consistently used in literature).

Lipschitz networks constrain the gradient p-norm $|\nabla_x f(x)|_p$ of your network with respect to the inputs to a maximum of your choice, lets say 1. In practice, there are multiple ways to do this. The requirements for a suitable implementation are as follows:

  1. $|\nabla_x f(x)|_p \leq 1$.
  2. Can approximate every $f(x)$ to arbitrary precision if 1. is fulfilled

The Layerwise Constraint

One very safe or deterministic way of implementing requirement 1 is constraining the jacobian operator norm of each layer with respect to the input. In fully connected networks, the weight matrices coincide with the jacobian, so it is convenient to constrain these directly, layerwise: $|W^i|_p \leq 1 \ \forall i$

However, a layerwise constraints can overdo the trick. Since the Lipschitz constant of a (fully connected) neural network is determined by the product of the jacobians and the lipschitz constants of the activations, a layerwise constraint easily accumulates into something that is much smaller than 1 and cannot recover the “full gradient”. (Anil et al., 2019) refer to this as gradient norm attenuation.

The GroupSort activation

The specific problem is the fact that the usual activation functions, while being Lipschitz-1, cannot maintain a maximum allowed gradient everywhere. For instance, if one neuron has a preactivation of $< 0$, ReLU will result in a gradient of 0 there and a possible $| \nabla_x f(x) | = 1$ is unachievable. (Anil et al., 2019) show this very nicely by trying to fit a layerwise constrained network with ReLU activation to the absolute value function. Spoiler: It does not work.

So they went ahead and derived a new activation function: GroupSort. It sorts the preactivations within n subgroups of the input. Example: GroupSort(1) is the full sort operation, GroupSort(d/2) is the MaxMin operation. Since it is merely a permutation, it maintains gradient 1 everywhere, while being a sufficient nonlinearity to serve as activation. Together with a specific constraint, they are able to prove universal approximation of GroupSort Lipschitz networks!
The weight norm constraint to achieve p-normed Lipschitzness is:

\[\begin{align} |W^1|_{p,\infty} &\leq 1 \\ |W^i|_\infty &\leq 1 \ \ \forall \ i > 1 \end{align}\]

Training the Lipschitz Network

Ok, so let us train a Lipschitz network for some binary classification task! For training data, let’s use the two-moons dataset. Using BCE as loss and Adam as optimizer, we can train a Lipschitz network with a Lipschitz constant of 1. We immediately see that the network is unable to achieve a good classification performance.

bce

The reason for this is not that the gradient is too constrained. In fact, we should be able to achieve perfect classification performance with any Lipschitz constant > 0, because the decision frontier is defined only by the sign of the output (or the sign of output - 0.5 if it’s in [0,1]), and the sign is scale invariant. So why does this not work?

Recall how BCE works: It tries to maximize the margins, i.e. get the output for class 0 as close as possible to 0 and the output of class 1 as close as possible to 1. With a sigmoid as output activation, this means unbounded increase for the preactivations to minimize BCE.
In unconstrained networks, that is fine and actually desirable. In a Lipschitz network, it is unadvisable to concentrate on margin maximization because, when the gradient is bounded, the objective may clash with the actual goal of classification: Maximizing accuracy. A loss function that cares about margin maximization up until a certain point is much better suited here: Hinge loss!

hinge

Hinge loss does not assign a penalty to training data outside of a margin of specified size, so the Lipschitz network can concentrate its efforts on optimizing the decision frontier.

More on this when I find more time.

References

  1. Anil, C., Lucas, J., & Grosse, R. (2019). Sorting out Lipschitz function approximation.
]]>
SSHing into a VM hosted on a remote machine, i.e. how to use VSCode with a remotely hosted VM2021-07-05T16:00:00+00:002021-07-05T16:00:00+00:00https://nolte.dev/jekyll/update/2021/07/05/vscode-and-remote-virtual-machinesDid you ever want to develop software in a VM that is hosted on a remote machine?
If so, you probably didn’t consider this being a problem when using an editor like vim, which you install right on site. However, I try to use VSCode with the remote extension, and it expects to be able to directly connect via some SSH config into the target server.
Specifically, my setup is as follows: I host a VM with Vagrant (called Y) on a remote server (called X) and want to use VSCode on my local machine to connect to Y.

Now, Vagrant in its default setting hosts a VM on localhost:2222 which you can then
ssh to via $ vagrant ssh. With $ vagrant ssh-config you will get the ssh config used to connect to it. It looks something like this, which works well from the remote machine:

Host default
  HostName 127.0.0.1
  User vagrant
  Port 2222
  UserKnownHostsFile /dev/null
  StrictHostKeyChecking no
  PasswordAuthentication no
  IdentityFile /path/to/some/generated/private_key
  IdentitiesOnly yes
  LogLevel FATAL

By pasting it into your ~/.ssh/config you can then use $ ssh default to connect to it. However, this only works on the remote machine where you set the VM up. In order to connect to it from the outside, you need to somehow connect to the remote machine X at port 2222, which is likely closed to the outside. The solution to this problem is a ProxyJump. ProxyJumps connect you to a remote machine via an intermediate “gateway” machine. The only unusual thing here is that the gateway machine is the same as the target, only a different port.

So I tried something like this. Notice that I copied the private key to my local machine.

# On local machine

Host X
  HostName X.com
  User nnolte
  ...

Host default
  HostName X.com
  User vagrant
  Port 2222
  ProxyJump X
  UserKnownHostsFile /dev/null
  StrictHostKeyChecking no
  IdentityFile path/to/copied/private_key

This did not work, gave me a connection refused, similar to when I tried directly connecting to port 2222. No expert, but probably because this still counts as an external connection. Funnily enough, swapping HostName X.com to HostName localhost turns out to be the solution:

# On local machine

Host X
  HostName X.com
  User nnolte
  ...

Host default
  HostName localhost # NOT X.com
  User vagrant
  Port 2222
  ProxyJump X
  UserKnownHostsFile /dev/null
  StrictHostKeyChecking no
  IdentityFile path/to/copied/private_key

✨✨✨✨
Cool, this worked. I am no expert on SSH config, but I did not expect localhost to be interpreted “relative” to the ProxyJump.

With this setup, you can connect to default from your local machine and use VSCode with the Remote extension as usual.

]]>
Composing functions in Python2020-06-12T14:00:00+00:002020-06-12T14:00:00+00:00https://nolte.dev/jekyll/update/2020/06/12/composing-functionsLearning a functional language is a very enjoyable experience. Haskell, in my case, is very different from imperative languages like C++ and Python. Even a “simple” thing like IO suddenly becomes a difficult piece of functionality. On the other hand, Haskell has some features that I miss elsewhere, like lazy evaluation or the way one can natively bind function arguments and compose functions.

Well, function composition is something that seems achievable with Python:

def f(x):
  return x - 2

def g(x):
  return 2 * x

x = 7

# now I want (g . f)
g(f(x)) # applied g on the output of f

What if i want to only compose, without immediate application? I just want the function that can be represented by h = (g . f)

x = list(range(10))

h = lambda x : g(f(x))

map(h, x)

Ok, that works fine. What if i wanted to have the incredibly convenient syntax as haskell has it? What we need to do then is to override a binary operator to compose. How about __mul__? The problem: Builtins cannot be extended

def compose(g,f):
  return lambda x : g(f(x))

type(f).__mul__ = compose
# TypeError: can't set attributes of built-in/extension type 'function'

Fortunately for us, clarete hacked around in the C-python bindings to make builtin extensions possible directly from python: forbiddenfruit. Whether or not that is a good idea, who knows?

from forbiddenfruit import curse

def f(x):
  return x - 2

def g(x):
  return 2 * x

curse(type(f), '__mul__', compose)

x = list(range(10))

list(map(g*f, x))
# [-4, -2, 0, 2, 4, 6, 8, 10, 12, 14]

One can also compose more than functions, h*g*f, or adjust the compose function to take *args or **kwargs.

What can you do with extended builtins?

]]>
All combinations of types in a tuple in C++2019-06-06T14:00:00+00:002019-06-06T14:00:00+00:00https://nolte.dev/jekyll/update/2019/06/06/fun-with-tuplesWhat?

We have a tuple of types

#include <tuple>

template<typename ...  Ts>
using t=std::tuple<Ts...>;

struct a{};
struct b{};
struct c{};

using my_tuple = t<a,b,c>;

and we would like to get all possible type combinations of length n, taken from this tuple.
That corresponds the nth power of cartesian products of the tuple. So, my result should look like this:

combinations<my_tuple, 2> // returns t<t<t<a,a>, t<a,b>, t<a,c>>,
                          //           t<t<b,a>, t<b,b>, t<b,c>>,
                          //           t<t<c,a>, t<c,b>, t<c,c>>>

Combinatorics in python

I like to prototype algorithms in python first and then translate.. Less fiddling with details
One possible solution to do combinatorics looks like that:

def combinations(arr, n, res=[]): 
    if n == 0: 
        return res 
    return [combinations(arr, n-1, res+[i]) for i in arr]

It will recursively call combinations, “keeping track” of the current indices by appending them to the result and then returning when we have reached the desired dimension.

Now with C++ types

Check all the types

There is a neat trick for checking which type you are currently fiddling with.
Declare some type that holds your type of interest and do not define it,
then gcc and clang give you a nice error if you try to instantiate one of these bad boys,
displaying your type nicely:

template<typename ... Ts>
struct type_printer;

int main () {
  type_printer<my_tuple>{};
}

in gcc-9.1 gives

<source>: In function 'int main()':
<source>:47:37: error: invalid use of incomplete type 'struct type_printer<std::tuple<a, b, c> >'
   47 |     type_printer<std::tuple<a,b,c>>{};
      |                                     ^
<source>:5:8: note: declaration of 'struct type_printer<std::tuple<a, b, c> >'
    5 | struct type_printer;
      |        ^~~~~~~~~~~~

Recurse in the type system

Recursing works fairly straight forward in the C++ type system.
You can see that in many parts of the STL and everywhere on StackOverflow.
Remember, we need something that refers to itself and some stopping condition.
A small example of recursion is something along the lines of std::make_index_sequence:


template<std::size_t ... Is>
struct index_sequence{};

//result... carries the ascending pack of integers
template<std::size_t n, std::size_t ... result>
struct make_index_sequence {
    //every time we iterate, we append n-1 to the result.
    using type = typename make_index_sequence<n-1, n-1, result...>::type;

};

//stopping condition: we will not continue if we reached 0
template<std::size_t ... result>
struct make_index_sequence<0, result...> {
    using type = index_sequence<result...>;
};

Some helpers

To concatenate and append to tuples types, we use these little helpers, making use of std::tuple_cat to determine the type:

template <typename... tups>
using tuple_cat_t = decltype(std::tuple_cat(std::declval<tups>()...));

template <typename tup, typename item>
using append = tuple_cat_t<tup, std::tuple<item>>;

We will also need to “iterate over tuples”, which is normally done via index sequences, therefore we define an index sequence with the length of a tuple:

template <typename tup>
using index_sequence_for_tuple =
    std::make_index_sequence<std::tuple_size_v<tup>>;

Element-wise tuple transformations

Now, we need a helper to execute one operation on each entry of a tuple and “return” a transformed tuple, very similar to boost::hana::transform

template <typename tup,
          template <typename> typename op,
          std::size_t... Is>
auto operate_t_impl(std::index_sequence<Is...>)
    -> std::tuple<op<std::tuple_element_t<Is, tup>>...>;

template <typename tup, template <typename> typename op>
using operate_t = decltype(
    operate_t_impl<tup, op>(std::declval<index_sequence_for_tuple<tup>>()));

The usual way get a parameter pack of the types from a tuple is

std::tuple_element_t<Is>(my_tup)...

Is is a parameter pack of the indices you want to gather the types from, so in our case all of them 0,1,2,3,4....
That is the reason for the existence of the helper index_sequence_for_tuple.
Since the index_sequence is no parameter pack as we need it for the tuple iteration,
we use a common trick involving function template argument deduction in operate_t_impl.
To get the std::size_t ... Is from our index_sequence, we give (a std::declval of) the sequence as function argument and let the argument deduction deduce std::size_t ... Is for us.

Ok, so now we can invoke “unary operations” (type-transformations) with a signature template <typename> typename op on all elements of the tuple, and “return” a result tuple. So to say, we just implemented poor mans hana::transform.

Bring stuff together

Now we can perform elementwise transformations on a tuple and recurse, lets bring it together to perform our task:

template <typename tup, std::size_t n, typename result = std::tuple<>>
struct combinations {
  //this "operation" is conceptually similar to a unary lambda given in std::transform
  //its python equivalent: combinations(arr, n-1, res+[i])
  template <typename item>
  using operation = typename ::combs<tup, n - 1, append<result, item>>::type;

  //this "loops" over the tuple, each time invoking operation, which takes care of the recursion
  //its python equivalent: [operation for i in arr]
  using type = operate_t<tup, operation>;
};

and the partial template specialization corresponding to the stopping condition:

//its python equivalent: 
//    if n == 0: 
//        return res 
template <typename tup, typename result>
struct combinations<tup, 0, result> { 
  using type = result;
};

Thats it, much less code that i would have expected when starting this exercise.. :D See the example on godbolt.

]]>