Patch Theory
Introduction
Versioning systems have been important tools in software development (and other areas) for a long time. In recent years, git has been the most successful one with github as the place to be for open source projects.
In the first part of this article I want to develop a sound mathematical theory of version control. I will propose definitions of states, changes, patches and merges.
While git was created out of the need for proper versioning within the linux project, darcs is a versioning system that is based on science and a theory of patches. While my theory won't be based on git, it will most likely be influenced by my daily work with it. So in the second part I will try to reflect on that and compare my theory to the one that darcs is based on.
My focus will of course be what mathematicians like best: Creating theories out of thin air and proving the existence of solutions without ever thinking about the possibility to get to this solution algorithmically in a decent amount of time. So in the third part I will spend a little time thinking about the possibility to implement any of it.
I will not give an introduction to how version control works today. There are several really good introductions on the net.
I am not sure what good this article will bring. At the very least, I expect writing it to be fun. But I also hope that a deeper understanding of version control will make it easier to work with it. In the very best (and least likely) case, I will find something that truly improves version control for everyone, like a way to eliminate the need for manual merges altogether or making git understand that I simply changed indentation.
The Basics
Say we have a set S
of states. We always use capital letters for states. For two states A, B ∈ S
we say (A, B)
or shorter AB
is the change from state A
to state B
and the elements of C := S²
are called changes.
S
is in this abstract definition just any set. But in practice it would more likely contain each and every point in history of each and every software project there is, has been, and ever will be.
Note that there is some kind of natural composition of changes: When I first change from A
to B
, and then from B
to C
, I have changed from A
to C
(in formula: AB * BC = AC
).
Patches
Now imagine you have a repo with some files, one of them called "main.py". Then you delete this file. This, in the terms we already defined, is simply a change from one state to another.
Now imagine you had not deleted that file, change something in a different file and went on to delete "main.py" then. This is also a change from on state to another, but they are different states than in the previous example. So although we did the same thing (delete "main.py") we have two completely different changes.
The mathematical tool for saying "these things are different, but in the current context they are really just the same" are equivalence relations. So what we need to do to get from changes to patches is to define an equivalence relation on the set of changes.
Then again, I would much prefer an axiomatic approach over actually defining that relation in every detail. Too strict definitions always carry the danger of incompatibility with related theories.
The Equivalence Relation
I propose the following axioms for the equivalence relation ~
:
AB ~ CD => BA ~ DC
AB ~ AC => B = C
AB ~ A'B' and BC ~ B'C' => AC ~ A'C'
AB ~ CD => AC ~ BD
The elements of P := C/~
are called patches. We denote patches by lowercase letters.
The first axiom simply says that when two changes are equivalent, their inverts must be equivalent, too.
The second axiom states that, if a patch can be applied to a state, the result is unambiguous.
The third axiom tells us that patches are compatible with composition.
The fourth axiom is an interesting one, mostly because it does not convey its true meaning easily. Think of it this way: We start at state A
and go to state D
, but we have two different paths. The one takes a detour over B
while the other goes over C
.
Say that AB
and CD
are equivalent. If you think in terms of line based patches, this is only possible if AC
does not touch any of the lines that CD
changes. And because in the end, both paths arrive at D
, this must mean that AC
and CD
are equivalent.
Corollaries
For 1, 2 and 4 there is actually equivalence
1 ensures that for every patch
a
there is an inverse patcha⁻¹
AA ~ BC <=> B = C
follows from 2 and 4 and shows that there is an identity patch we callid
.
Applying a Patch
We can interpret a patch a
as a function (which is compatible to the set theoretic definition of a function) with the following domain and codomain:
dom(a) := {X | ∃ Y ∈ S: XY ∈ a}
cod(a) := {Y | ∃ X ∈ S: XY ∈ a}
We say, a patch a
is applicable to a state A
, iff A ∈ dom(A)
.
Composing patches
Up until now we have a notion of what a single patch is. We know how to apply it where it is applicable.
But a repo consists of dozens, sometime thousands of patches. So it is important to think about how multiple patches can be composed.
As patches, as we have defined them, are in fact functions, the first guess might be to use function composition: If a, b ∈ P
and cod(a) ⊆ dom(b)
, we can write b * a
.
The problem with this is that we can not be sure that b * a
is a patch. Axiom 2 makes sure that all changes in that set are equivalent. But we do not know whether elements of the equivalence class are missing. Take for example a⁻¹ * a
. This should equal id
, but in fact it is id
restricted to dom(a)
.
There is a second problem. What do we do if the condition cod(a) ⊆ dom(b)
is not complied with? We could simply say that these patches conflict, but it is an important goal to create as few conflicts as possible.
So I prefer this definition: For two patches a, b
and X, Y, Z ∈ S
such that XY ∈ a
and YZ ∈ b
, the composition of a
and b
is defined as ab := [XZ]
.
The result is always a patch
It works in many cases where
cod(a) ⊆ dom(b)
is not complied withIt only creates a conflict if
cod(a) ∩ dom(b)
is empty
Independence
Axiom 4 can now be reformulated in terms of patches:
ab = bc => a = c
Patches a, b ∈ P
for which ab = ba
is true are called independent. id
is independent from all patches.
Further Ideas
These are notes on ideas on some parts that may be interesting to work on in the future.
Simple Questions
There are some questions about patches that seem harmless but are not all that easy to get an answer for. Examples:
Is there a patch
a
withdom(a) = S
except forid
?Is there a patch
a
withdom(a) ∩ cod(a) != 0
except forid
?
Patch sets
The composition I gave in the previous chapter is not associative. Example: Say patches a, b
conflict. Then (ab)b⁻¹
is not defined, while a(bb⁻¹)
is.
There is an obvious solution to this problem: Patches should simply always be applied in an order that makes most sense. That is, we do not look at (ordered) sequences of patches, but at patch sets. A merge, in this scenario is simply a union of two patch sets.
But there are obvious issues with this approach, too. It is not clear what the "order that makes most sense" should be and whether it is unambiguous. And while most things I said before are compatible with common version control systems we use today, this is not.
State as Patches
Now that we have patches we can fix one state we may call 0
and identify every state A ∈ S
with the patch [0A]
where 0
is identified with [00] = id
.
Of course this is only a small subset of patches (more specifically, it is {a ∈ P | 0 ∈ dom(a)}
). But it may have some interesting structural properties.
Conflicts
Group structure
The set of patches P
has a structure similar to a group: There is a neutral element id
and for every element a
, there is an inverse element a⁻¹
with aa⁻¹ = id = a⁻¹a
.
But unfortunately the composition is only defined for patches a, b ∈ P
that comply with the condition cod(a) ∩ dom(b) != 0
. The theory would get a whole lot nicer if we could manage to extend P
to contain one or many elements that represent conflict. I did not come up with a sensible way to force P
into a group, but I guess it would be worth it.
Manual conflict resolution
An issue that I imagine is very important in developing actual version control systems is aiding humans in resolving conflicts. I did only scratch that topic when I said that I wanted to provoke as few conflicts as possible.
But sometimes there are real conflicts and in those cases it is no good trying to be smart. So breaking a patch into multiple parts, applying those that can be applied automatically and presenting the other to a human in a humane way is something that needs to be done.
I did not yet think about it at all.
Conclusion
I believe I have come up with a stable foundation for patch theory. The idea to define patches as equivalence classes over the square of states proved to be flexible and elegant. I have analyzed patches to have a group-like structure, but did not yet manage to force them into actually being a group.
I guess my thoughts are still a bit messy. So sorry for that.