Ruby Module Linearization

This is the second article in the Ruby's Dark Corners series.

Mixin Modules

In Ruby, classes can only inherit from a single class. However they can include multiple modules.

A class that includes a module will be able to use the methods defined in the modules. A class doesn't really "inherit instance variables". Instance variables must not be declared, they exist just by virtue of being accessed. This means that if a class and its superclass (or an included module) both access a variable @x, they will access the same variable.

This is because of late binding (aka dynamic dispatch): both instance variables and methods will be looked up from self at run-time.

While here is only a single copy of each instance variable, there could be multiple method declarations with the same name. How does Ruby select which one to use?

The answer is a process called linearization, which defines how Ruby orders all the (transitively) inherited classes and included modules. You can inspect this ordering through the Module#ancestors method. For instance try Integer.ancestors.

More than lookup, linearization also controls the behaviour of calls to super methods through super(). Ruby reorganizes all modules and classes into a single inheritance chain (the ancestor chain) and consecutive calls to super() move up this chain.

Confusing Examples

But how does linearization work, exactly?

Here are a couple scenarios, what do you think the behaviour for each of this scenario is? Each is tailored to show something about the linearization process.

# Scenario 1

module A1; end
module B1; end
module C1
  include A1
  include B1
end

# C1.ancestors    => [C1, B1, A1]
# Scenario 2

module A2; end
module B2; end
module C2
  include A2
end
module D2
  include B2
end
module E2
  include C2
  include D2
end

# E2.ancestors    => [E2, D2, B2, C2, A2] 
# Scenario 3

module A3; end
module C3
  include A3
end
module D3
  include A3
end
module E3
  include C3
  include D3
end

# E3.ancestors    => [E3, D3, C3, A3]

So far, so good.

# Scenario 4

module A4; end
module B4; end
module F4; end
module C4
  include B4
  include A4
end
module D4
  include F4
  include A4
end
module E4
  include C4
  include D4
end

# E4.ancestors    => [E4, D4, C4, A4, F4, B4]
# Scenario 5

module A5; end
module B5; end
module C5
  include B5
  include A5
end
module D5
  include A5
  include B5
end
module E5
  include C5
  include D5
end

# E5.ancestors    => [E5, D5, C5, A5, B5]

Pretty surprising, isn't it?

Two things further complicate the situation:

The Algorithm

The linearization algorithm is very hard to describe precisely with a few sentences, so I'll just serve you the code:

Include = Struct.new :mod
Prepend = Struct.new :mod

# Returns the ancestry chain of `mod`, given the environment `env`.
#
# No distinctions are made between classes and modules: where a class extends
# another class, that class is treated as the first included module.
#
# params:
# - mod: Symbol
#   Represents a module.
# - env: Map<Symbol, Array<Include|Prepend>>
#   Maps modules to inclusions, in order of apperance.

def ancestors (mod, env)
  chain = [mod]
  includes = env[mod]
  includes.each { |it| insert(mod, it, env, chain) }
  chain
end


def insert (mod, inclusion, env, chain)
  i = inclusion.is_a?(Prepend) ? 0 : 1
  ancestors(inclusion.mod, env).each do |it|
    raise ArgumentError('cyclic include detected') if it == mod
    j = chain.find_index it
    if not j.nil?
      i = j + 1 if j >= i
      next
    end
    chain.insert(i, it)
    i += 1
  end
end

Intuition

Let's try to get some intuition in there. Let's start with some observations:

The last point highlight the most unintuitive thing about the algorithm: modules included later take precedence, but when considering conflicts of ordering, it's modules included earlier that win!

The reason is that in Ruby, we may include a module in another at any time. Ruby never reorders modules in the ancestor chain, but has no problem inserting modules in between existing modules. This does make sense, but leads to puzzling linearization behaviour.

Making module included earlier take precedence would solve the problem. However when we include a module in another at runtime, we usually would like it to take precedence on previously included modules!

Inheritance vs Inclusion

As mentionned in the comment, there is no difference between inheritance and inclusion with regards to linearization. An inherited class is treated as though it was the first included module.

Merging Ancestor Chains

If you look at the algorithm, you will see that it is a recursive process: to include a module, you need to acquire its ancestor chain, then merge it with the current ancestor chain of the includer.

The way this is done — for regular includes — is to start just after the class then insert the modules one by one. We only diverge from this if a module is already present in the ancestor chain. Either it is further ahead in the chain: then we skip to that position and continue the process; or it is earlier in the chain (signifying there is an ordering conflict between included modules) and we remain at the current position (in both cases the module is not re-inserted).

Include vs Prepend

The only difference between include and prepend is that prepend will try merging the ancestor chain of its parameter before the class in the class own's ancestor chain:

module M; end
class C; prepend M; end
C.ancestors => [M, C, ...]

Doing Better?

Despite its idiosyncracies, I would be hard pressed to improve on the algorithm without losing some of its nice properties. The key tension is that later calls to include should take precedence (to enable effective run-time meta-programming), but re-ordering modules is objectively bad.

If you have any idea, make sure to share them!

It should also be said that, of course, scenarios such as ordering conflicts rarely happen in practice. Most people ignore these rules and are not really worse off because of it.