It's been a while since I've written about any data structures. But it just so happens that I'm actually really working on implementing a really interesting and broadly useful data structure now, called a Rope.
A bit of background, to lead in. I've got this love-hate relationship with
some of the development tools that Rob Pike has built. (Rob is one of the Unix
guys from Bell labs, and was one of the principal people involved in both the
Plan9 and Inferno operating systems.) Rob has implemented some amazing
development tools. The two that fascinate me were called Sam and Acme. The
best and worst features of both are a sort of extreme elegant minimalism.
There's no bloat in Rob's tools, no eye-candy, no redundancy. They're built to
do a job, and do it well - but not to do any more than their intended job.
(This can be contrasted against Emacs, which is a text editor that's grown
into a virtual operating system.) The positive side of this is that they're
incredibly effect, and they demonstrate just how simple a programmers text
editor should be. I've never used another tool that is more effective than
Acme or Sam. In all seriousness, I can do more of my work more easily in Sam
than I can in Emacs (which is my everyday editor). But on the other hand, that
extreme minimalist aesthetic has the effect of strictly eliminating any
overlaps: there's one way to do things, and if you don't like it, tough. In
the case of Acme and Sam, that meant that you used the mouse for damn-near
everything. You couldn't even use the up and down arrows to move the cursor!
This is a non-starter for me. As I've mentioned once or twice, I've got terrible
RSI in my wrists. I can't use the mouse that much. I like to keep my hands on my
keyboard. I don't mind using the mouse when it's appropriate, but moving my hand from
the keyboard to the mouse every time I want to move the cursor?. No. No
damned way. Just writing this much of this post, I would have had to go back and
forth between the keyboard and mouse over 50 times. (I was counting, but gave up when
I it 50.) A full day of that, and I'd be in serious pain.
I recently got reminded of Acme, because my new project at work involves using a
programming language developed by Rob Pike. And Acme would really be incredibly
useful for my new project. But I can't use it. So I decided to bite the bullet, and
use my free time to put together an Acme-like tool. (Most of the pieces that you need
for a prototype of a tool like that are available as open-source components, so it's
just a matter of assembling them. Still a very non-trivial task, but a
possible one.)
Which finally, leads us back to today's data structure. The fundamental piece of
a text editor is the data structure that you use to represent the text that you're
editing. For simplicity, I'm going to use Emacs terminology, and refer to the
editable contents of a file as a Buffer.
How do you represent a buffer?
As usual with data structures, you start by asking: What do I need it to do? What performance characteristics are important?
In the case of a text buffer, you can get by with a fairly small set
of basic operations:
- Fast concatenation: concatenating blocks of text needs to be
really fast. - Fast insert: given a point in a block
of text, you need to be able to quickly insert text
at that point. - Fast delete: given two points in a block of
text, you need to be able to quickly remove the text between those points. - Reasonably fast traversal: Lots of algorithms,
ranging from printing out the text to searching it are based
on linear traversals of the contents. This doesn't have to be
incredibly fast - it is an intrinsically linear process, and it's
usually done in the context of something with a non-trivial cost
(I/O, regular-expression scanning). But you can't afford to make it
expensive. - Size: you need to be able to store effectively unlimited
amounts of text, without significant performance degradation in the
operations described above.
Considering those requirements, you can eliminate a number of obvious
alternatives:
- Simple character array: Inserts and deletes
are very expensive for large documents. - Lists of lines: if you use a linked list, the memory cost
for pointers becomes very expensive. If you use an array of
lines, copy-insert costs of new lines becomes very expensive
for large documents. - Gap buffers: a neat variation on character arrays; they have
the same problem on large documents.
Hybrids of the above can work nicely. Linked lists of character arrays
can work very nicely. In fact, the structure that I'm going to describe
is really a variation on the list of character arrays.
What works extremely well is to take the basic idea of a list of
mini-buffers (that is, smallish character arrays), and tweaking it,
so that instead of keeping a list of sub-buffers, you put
the sub-buffers into a binary tree. The result is a structure called
a rope. To the best of my knowledge, ropes were invented by
Hans Boehm (of garbage collection fame), and described in
this paper. Ropes are
a simple, elegant data structure that's excellent for representing large string-like
data structures. (The name is a pun: a real-world rope is made by twining together lots of smaller strings. A data-structure rope tangles together data-structure strings.)
I'm not going to present the full-blown version of ropes in this post; instead, I'm going to work up to that. In this post, I'm going to start by showing you a simplified version that illustrates the basic ideas. In later posts,
I'll work up to full-blown ropes, including things like sub-rope sharing, copy-avoidance via copy-on-write, and so on.
The basic idea of a rope is amazingly simple. A rope is binary tree
with strings in the leaves. An internal node in a rope tree doesn't contain any text - just pointers to its left and right children. The text value of a rope is the concatenation of its leaves, in left to right order. (This basic structure is also called a concatenation tree.) An example of a rope is shown to the side.
So - if you've got two huge strings of text that you want to concatenate, you just create a new root node, and put the two strings in as children. Presto! You've got a new rope containing the concatenation! For example, below are two rope-trees - one for the text "abcdefghijkl", and one for the text "zyxwv". Beneath them is the result of concatenating the two ropes.
Concatenation is therefore constant time, regardless of the length of the strings
being concatented. - you can't ask for better than that! Traversing the text is also
really easy - you just do an in-order traversal of the rope-tree. In the worst case -
a degenerate tree with one character per node - that gets very expensive. But it's
easy to avoid the degenerate case, as we'll see in a later post. In the typical case,
where each leaf node has a reasonable amount of text, and the tree is reasonably
balanced, the additional cost of traversing the tree is negligible in comparison to
the cost of traversing the characters in the nodes. (Each pointer in the tree needs
to be traversed once, which makes the total cost O(n) in the length of the text. But
since there's typically a lot of text per leaf, that ends up meaning that the
increase in traversal cost is minimal.)
For string-mutation operations, we can define almost anything we want in terms of
two primitives: concatenate, and subrope. For example, we can define how to
do insert and deletion of chunks of text from a rope as follows:
- Remove text: to remove a chunk of text that runs from point P1 to point
P2 in a rope, we take the subropes (0, P1) and (P2+1,end), and then
we concatenate them. The result is the rope with the chunk removed. - Insert text: to insert a chunk of text C into a rope at a particular
point p, we take the pre-subrope from 0 to P, and the post-subrope
from P+1 to the end. Then we concatenate pre-subrope, C, and
post-subrope.
So to provide those operations, we just need to figure out how to do substring in
an efficient way.
Substring isn't trivial in ropes. But it's not awful either. The main trick is
that there are a lot of cases to consider to get an implementation correct.
The basic idea is fairly simple. You traverse the tree, keeping track
of your position in the traversal. When you get to the substring you
want, you start copying - usually just copying the pointers to nodes. You
keep copying until you get to the end of the desired substring. The catch,
as I said above, is keeping track of all of the cases.
An outline of substring is:
- Inputs:
- A rope, R.
- The start position of the desired sub-rope, Start.
- The length of the desired sub-rope, L.
- Algorithm:
- Let currentPosition = 0
- Let remainingLengthToCopy = L
- Let result = EmptyRope
- For each node N in the tree:
- If N is an internal node: traverse the left subtree, and then the right
subtree. - If N is a leaf, then:
- The start position of this node is currentPosition;
the end position of this node is currentPosition plus the
length of the string in this node. - If the start position of the desired subrope is
beyond the end of this leaf, then just increment
currentPosition by the size of the string in this leaf. - If the start position of the desired subrope is inside
of this leaf, then start copying - produce a new
rope node containing the substring (There are sub-cases
here for "subrope also ends in this node", and "subrope"
doesn't end in this node.) Increment currentPosition by
the size of the string in this node, and decrement
remainingLengthToCopy by the size of the copied substring. - If the start position of the desired subrope is before this
node, and the end position of the desired subrope is after
this node, then insert this node into the result rope,
increment the position by the size of the string in this node,
and decrement remainingLengthToCopy by the length of the string
in this node. - If the start position of the subrope is before this node,
and the end position is in the node, then copy the appropriate
substring to a new leaf, and concatenate with result. Then
return result. - If the start position of this node is after the end of the desired
sub-rope, return result.
- The start position of this node is currentPosition;
An example implementation is below. It's a quick one that I whipped together in
Haskell. (I actually wrote versions in Haskell, OCaml, Python, and Java, but in
the end, I think that the Haskell is the easiest to follow.)
module Ropes where -- A rope is either an internal node with two children and no text, -- or a leaf node with text and no children. data Rope = InternalRopeNode Rope Rope | LeafRopeNode String deriving (Show) concatRope :: Rope -> Rope -> Rope concatRope (LeafRopeNode "") r = r concatRope r (LeafRopeNode "") = r concatRope r1 r2 = InternalRopeNode r1 r2 -- Subrope implemented using a fairly standard translation -- of iteration to recursion. The state of each iteration is -- represented by a triple consisting of the current position in -- the rope, the number of characters still to be copied, and -- the current subrope. subRopeIter :: Rope -> Int -> Int -> (Int, Int, Rope) -> (Int, Int, Rope) -- For an internal node, call subRope of the left child on the input -- state, and then take the state resulting from the left-child invocation, -- and use it as the input to the right child. There is a nice Haskell -- syntax for this, but I think the let is clearer for non-Haskell -- people. subRopeIter (InternalRopeNode left right) start len (pos, remaining, result) = let (leftPos, leftRemaining, leftResult) = subRopeIter left start len (pos, remaining, result) in subRopeIter right start len (leftPos, leftRemaining, leftResult) -- The real work happens on leaf nodes. We define it by cases: subRopeIter rope@(LeafRopeNode s) start len state@(pos, remaining, result) -- (1) node_start + len(s) < start: node content comes before the start point. -- Just increment the position past the node and continue. | remaining == 0 = state -- (2) node_start < start and node_start + len(s) > start: -- Copy substring, decrement remaining, increment position. -- (2a) node_start + len(s) > pos + remaining -- (2b) node_start + len(s) < pos + remaining | pos + length(s) < start = (pos + (length s), remaining, result) | pos < start && pos + (length s) > start = let startPosInNode = (start - pos) substr = (drop startPosInNode s) in if (length substr) < remaining then (pos + (length s), remaining - (length substr), (concatRope result LeafRopeNode substr))) else (pos + (length s), 0, (concatRope result (LeafRopeNode (take remaining substr)))) -- (3) node_start > start and cur_remaining > 0 and cur_remaining > len(s): -- Node lies entirely within the copy range. Copy entire node into -- result, decrement remaining, increment position. | pos > start && length(s) <= remaining = (pos + (length s), remaining - length(s), (concatRope result rope)) -- (4) node_start > start + len (aka remaining == 0): -- Increment pos. | pos > start && length(s) > remaining = (pos + length(s), 0, (concatRope result (LeafRopeNode (take remaining s)))) | otherwise = state subRope :: Rope -> Int -> Int -> Rope subRope r start len = let (_, _, result) = subRopeIter r start len (0, len, LeafRopeNode("")) in result
For example, we can define a non-trivial rope as follows:
let r = InternalRopeNode (InternalRopeNode (LeafRopeNode "abc") (InternalRopeNode (LeafRopeNode "def") (LeafRopeNode "ghi"))) (InternalRopeNode (LeafRopeNode "jkl") (LeafRopeNode "mno"))
That's a fairly complicated rope-tree of "abcdefghijklmno". In a real rope application, a string that short would really just be one node. But for illustration,
it'll do. To get the sub-rope of length 7 starting from position 5, we'd run:
subRope r 5 7
. When I do that in GHCI, I get the following:
InternalRopeNode (InternalRopeNode (LeafRopeNode "f") (LeafRopeNode "ghi")) (LeafRopeNode "jkl")
Or "fghijkl" - exactly right. And to do the substring operation, we needed
to traverse the rope tree, copy two pointers (to the "ghi" and "jkl" nodes),
create one new leaf node (for "f"), and create two new internal nodes for
the concatenations.
- Log in to post comments
Just use plain old arrays :)
Seriously, I had to write a specialized text editor few years ago. At first, I've tried to use rope-based strings.
However, when I later replaced ropes with plain old std::string - I got absolutely no visible slowdowns.
Of course, if you're going to edit 1Gb texts it's a completely different story.
I know this is off topic but I am a masters candidate in mathematics, research area is graph theory, and few fellow math students and I have just released a new podcast called Combinations and Permutations that I think that you and your audience may enjoy. Check it out in iTunes or Combinations and Permutations
I remember first encountering ropes when working on the ICFP 2007 contest (http://save-endo.cs.uu.nl/). The problem required fast manipulation of large strings, so it made a tremendous difference to get your data structures right.
Thanks, very interesting. Would you mind if i translate it to russian? :)
And by the way, what will be optimal leaf length in real applications? :) 10, 20, 50 symbols?
I've found that it's convenient for some operations to cache the length of the string in each node. Substring operations, for example, may not need to traverse the whole tree.
Oh, and the other thing, of course, is that in practice you need to do some balancing on the spine of the rope to avoid degenerate behaviour.
Re #5:
The optimal length of nodes in a rope is going to be somewhat dependent on the specifics of the application. In general, you want to try to balance the benefits of having discontinuous storage with the cost of having to traverse pointers. In some very artificial experiments on a 64-bit linux machine, it looks like it's in the 256 character range for edit operations including frequent inserts. I expected it to be a bit higher than that - that's really a lot of fragmentation.
Re #6:
Yeah, cacheing node length is an obvious optimization. It can shift the balance of the ideal node length, because it adds to the size of the node, but that's not a big deal.
On the other hand, looking around at various rope implementations on the net, some people have tried to cache position. That's a losing proposition - the cost of maintaining that as you move things around is not trivial, and it means that you can't do node sharing.
WordStar 6.0
Customize the function keys.
Ah, memories. We had some of these same concepts in Dystal, a language I implemented for a guy in the sociology department named Jim Sakoda. It was embedded in FORTRAN II. I was never clear on why this kind of processing was particularly important to sociologists.
I was once inspired by Pike's presentation of Plan 9 at an OS conference (Asilomar?) to go home and redesign the Secure Kernel I was implementing for MITRE/ARPA to follow its principles. Fortunately the project was canceled, as its philosophic contradictions might have caused CPUs to melt.
Mark:
Is there any chance you'd post the OCaml, Python and Java versions? It would be OK if they came with a disclaimer "Not fully tested" or something, but I think it'd make an interesting case study for comparison. I don't doubt that the Haskell version is the most readable -- this is the kind of task where Haskell shines.
-- Michael Chermside