Last week, I promised to continue my discussion of ropes. I'm going to
break that promise. But it's in a good cause.
If you're a decent engineer, one of the basic principles you should
follow is keeping this as simple as possible. One of the essential skills
for keeping things simple is realizing when you're making something
overly complicated - even if it's cool, and fun, and interesting.
Working out the rope code for more complex operations, I found myself
thinking that it really seemed complex for what I was doing. What I'm
doing is writing myself a text editor. I'm not writing a string
processing framework for managing gigabytes of data; I'm just writing a
text editor.
So I decided to try, for the purposes of testing, to look at one of
the simpler data structures. A classic structure for implementing editor
buffers is called a gap buffer. I'll explain the details below.
The drawback of a gap buffer is that large cursor motions are relatively
expensive - the cost of moving the cursor is proportional to the distance
that you move it.
So I wrote a test. I implemented a basic gap buffer. Then I built a
test that did 300,000 inserts of five characters; then slid the cursor
from the end, to the beginning, back to the end, back to the beginning,
and back to the end again. Total execution time? 150 milliseconds. And
that includes resizing the buffer 12 times, because of a deliberately
stupid buffer sizing and copying strategy.
And with that, out the window went the ropes. Because in Java,
using a slapped together, sloppy piece of test code, which does things in a dumb
brute force way, it can do something much more complicated than anything an editor
will encounter in routine use, the gap buffer achieves performance that is more than
adequately fast. (Based on some experiments I did back at IBM, delays of 1/10th
of a second are roughly when people start to notice that an editor is slow. If you can respond is less than 1/10th of a second, people don't perceive a troublesome delay.)
If the gap buffer can achieve the needed perfomance that easily, then there's
absolutely no reason to do anything more complicated.
Ok. So, what's a gap buffer?
It's a fascinatingly simple data structure - exactly the kind of
thing that appeals to my sense of software aesthetics. It's a great big
buffer - an array of characters. It's got two segments of text - one at
the beginning of the buffer, which grows upwards; and one at the end of
the buffer, which grows downwards. The space between those two segments
is called the gap. The gap is the cursor in the
buffer. To move the cursor forwards, you just move the gap: move one character from the
beginning of the end segment to the end of the beginning segment. To
insert a character at the cursor, you add a character to the beginning
segment. To delete the character before the cursor, you just subtract one
from the length of the beginning segment.
For example, below is a diagram of a gap buffer. In the first version, it contains the text "Hello there readers", with the cursor
after the "r" in "readers".
The next version
moves the cursor backwards one step. So we copy one character from
the end of the pre-gap region to the beginning of the post-gap region,
and then adjust the pointers. We don't even bother to erase the buffer
position within the gap that used to contain the "r" - we just move
the pre-gap pointer.
In the third version, we insert the string "my" - just write the characters "m" and "y" into the next positions in the pre-gap region,
and advance the pre pointer. The resulting buffer contains the text
"Hello there myreaders".
That's all there is to it. A gap buffer is just a character array, with two pointers/indices to track the position of the beginning and ending segments. It's incredibly simple - and yet, it's also
quite surprising. I certainly wouldn't have thought of it on my own!
For its simplicity, it still manages to achieve outstanding performance on all of the most common actions in an edit buffer:
- Local cursor motions are extremely fast; it's always exactly
O(1) cost for a single character cursor move, O(n) for arbitrary
cursor motions. - Character inserts are really fast. They've got an amortized
average cost of O(1). (The reason for that amortized cost is
because you sometimes need to expand the buffer, which
requires a lot of copying.) - Character deletes are super fast - just increment or decrement
a pointer - O(1). - Search/scan is reasonable. Stream oriented operations like
search have an absolute linear-time lower bound, which is
obviously achieved by the gap buffer. And as important,
implementing stream-oriented operations is extremely
easy to do correctly on a gap buffer.
Finally, I'll close with some code. This is a basic gap buffer
implementation in Scala, my new favorite programming language. (I'm
long been a huge fan of OCaml. Scala is very much like OCaml adapted
for the JVM, with a better syntax.)
// Copyright 2009, Mark C. Chu-Carroll package com.scienceblogs.goodmath.apex.buffer class GapBuffer(size : Int) { var _buf : Array[Char] = new Array[Char](size) var _size = size var _presize = 0; var _postsize = 0; /** * Move the cursor one character forward. */ def stepCursorForward() = { if (_postsize > 0) { _buf(_presize) = _buf(_size - _postsize) _presize = _presize + 1 _postsize = _postsize + 1 } } /** * Move the cursor one character backward. */ def stepCursorBackward() = { if (_presize > 0) { val c = _buf(_presize - 1) _buf(_size - _postsize - 1) = c _postsize = _postsize + 1 _presize = _presize - 1 } } /** * Move the cursor. * @param distance the distance to move the cursor. */ def moveCursor(distance : Int) = { if (distance > 0) { for (i <- 0 until distance) { stepCursorForward() } } else { val d = - distance for (i <- 0 until d) { stepCursorBackward() } } } /** * Move the cursor to a specific position. * @param loc the desired cursor position. */ def moveCursorTo(loc : Int) = { moveCursor(loc - _presize) } /** * Insert a character immediately after the cursor position. * @param c the character to insert. */ def insertChar(c : Char) = { if (_presize + _postsize == _size) { expand() } _buf(_presize) = c _presize = _presize + 1 } /** * Insert a string after the cursor position, and advance * the cursor to the position after the inserted string. * @param s the string to insert */ def insertString(s : String) = { for (i <- 0 until s.size) { insertChar(s.charAt(i)) } } /** * Double the size of the character storage buffer. */ def expand() = { val newsize = _size * 2 var newbuf = new Array[Char](newsize) for (i <- 0 until _presize) { newbuf(i) = _buf(i) } for (i <- 0 until _postsize) { newbuf(newsize - i - 1) = _buf(_size - i - 1) } _buf = newbuf _size = newsize } }
- Log in to post comments
Implement line length, and you have fast WYSIWYG for the VT100
Re #1:
In fact, I've actually got an implementation working which includes line/column tracking,
cut and paste, and undo. But I thought that that added too much noise - the basic concept of the gap buffer is so beautifully simple, and I wanted to try to really show that. This dinkly little fragment of code does most of the work for an editor back-end!
Interesting
I was just talking to my wife about sorting and the TSP, a couple of topics computer scientist professors like to use while teaching. She has a background in math and computer science. I told her that the difference between a young computer scientist and an experienced engineer is the computer scientist will spend man-weeks figuring out how to find an optimal solution, and an engineer will spend a couple of hours estimating and implementing a solution that is good (fast-, close-enough-to-optimal, etc.) enough. E.g., bubble sort is perfectly fine for sorting small lists.
To be fair, even young engineers will spin their wheels making things more complicated than necessary for the task at hand. And to be fair to younger engineers, many engineers, young and old, let resume envy drive solutions to new problems (e.g., when the next programming language comes out, many engineers see every problem as a nail custom made for the new hammer they want so badly to use.) It is enough of a problem that I always try to ask, when a new toolset is being proposed, why, what's wrong with the tools we already have?, whether it is me or others proposing a new tool. I am being paid to solve problems, not pad my resume, after all.
Good, simple solution. Obviously a multi-GHz processor has made this a good viable solution again.
WordStar 6.0 for DOS with user-customized Function and Ctrl-Function keys. Unformatted mode. Add mouse utilities if you are feeling gnarly. It will handle 20 MB files no sweat in modern iron... and you get column-block operations.
WordStar sort utilities are slow, especially column block sorts. Use Robert Pirko's RPSORT (102) on text files.
Uh, Uncle Al:
MarkCC uses a Mac. And I think he just prefers to roll his own. 8^)
But, if he uses Boot Camp...
Re KeithB, UncleAl:
Actually, I'd rather use something that's already been written. I just haven't been able to find what I want.
Acme convinced me that you could build a full-fledged programming environment without the fancy GUI bells and whistles that dominate things these days. All of the fancy things that are so common in typical development tools can be done in Acme - only it does it without wasting space on fancy UI gadgetry, without embedding its own complete programming language, and without forcing me to waste my time dragging through endless menus. It made a programming tool out of something simple but powerful, and made it effective and extendible in a great way.
But the way it's designed, it makes very heavy use of the mouse. In Acme, you can't even move the cursor up one line without using the mouse. You can't start a search without the mouse. Nothing. With my RSI trouble, I can't rely on the mouse that much. It hurts to go back and forth between keyboard and mouse.
I haven't been able to find anything that's built on a basic design like Acme, but doesn't rely on the mouse for *everything*. If I could find something that did what I wanted, or even came *close* to doing what I wanted, I'd drop this code like a hot rock. I don't *want* to build an editor. I just don't know how else to get the kind of tool that I want.
If you know of anything that sounds like what I've talked about, *please* let me know. I'd really love to find something.
(I've looked at the obvious things: Emacs, Vim, and their relatives. Eclipse, Netbeans, JBuilder. JEdit. aoeui/asdfg. J. Eddie. Of course Sam and Acme and Wily.)
MCC, regarding RSI, I got over mine by using a mouse left handed at work and right-hand a thumb-operated track ball at home. I'd switch hands too, but I haven't found a left handed track ball yet.
I like codewrite.
Re #8:
Believe me, I've tried just about everything for my RSI.
My main problem isn't the mouse. It's not that using the mouse will make it worse, but just the constant arm motion hurts because of the damage that's already done.
At work, I use a Kinesis Countoured keyboard, which is absolutely wonderful. I can use that for hours on end without any problem; it lets me take all of the weight off my wrists, and to type at speed with minimal finger extension. So when I'm on that keyboard, I'm good. But when I need to take my hands off of it, that's the problem. Doing it once in a while is fine. But doing it the way you need to for Acme or Sam? Every single time I need to move my cursor? No.
Re #9:
Defunct commercial development tools for Windows are not exactly what I'm looking for...
I don't mind paying for software, but as it happens, I need something that I can use on my Mac, my linux workstation at work, and my linux laptop at home. So even if codewright were still available, it wouldn't fit the bill.
Does anyone make a foot-operated mouse? I can imagine Mark's desk transforming into some crazed phantom-of-the-Opera organ console :)
I am not sure why Borland discontinued CodeWrite, but probably because there wasn't much left to improve, and the trend for tool providers to package text editing/project management as part of an IDE. Even without an RSI, I loathe being a slave to the mouse.
Foot-operated mouse: what's the market for that? It wouldn't be hard for me (or many others) to design if there were a market. Voice controlled mouse is another option, but might not be useful if you work in a cubicle environment. Keyboard alt and esc sequences, cscope (or similar) and emacs seem to be more efficient solutions, especially if you program your own macros--but I haven't used them in years.
I'm looking froward to brainwave controlled user interfaces, myself, or some as yet unimagined alternative to mice and keyboards.
Mark:
maybe you should just write an operating system hack that allows keyboard control over the mouse. I know the Amiga used to have that...
I discovered gap arrays a while back in Sather (a sadly underused language).
Since then I've used them in a couple of classes (in "C") as an assignment - but with a wrinkle. Start with the basic gap array, then add more gaps, which mean that the bookkeeping increases - in the simplest way linearly with the number of gaps. Set up a test harness that does random (but correlated in position) inserts and deletes and see how well they perform as the number of gaps increases.
At some point I realized that you could arrange the gaps in a tree structure and so searching the array of gaps becomes logarithmic. When I read your post it occurred to me that the tree-structured multi-gap array (TSMGA) looks rather like the ropes you started with.
Buffer gaps don't need multi-GHz CPUs, they were invented before multi-MHz CPUs. The canonical description was in Craig Finseth's 1980 Master's thesis, and the technique was used in TECO.
For details, see Finseth's "The Craft of Text Editing": http://www.finseth.com/craft/
Re #16:
Gap buffers passed through a time when they weren't the best solution.
Originally, gap buffers were used for what we would consider tiny files. The average file size for files in the original TECO was something under 16K!
At some point in time, when file sizes had gotten much larger, but processors weren't so fast, it made sense to try to use some kind of more clever structure, like Ropes - because they allowed you to avoid doing large copies for large cursor motion.
Now we're back at the point where the copying makes sense again. But I think it's not really because of the raw speed of processors, but because of cache effects. Caches have gotten so large that you'll typically have the entire buffer in cache. With the pointer games in structures like ropes, you're much more likely to wind up with cache misses.
I use FAR Manager (http://www.farmanager.com/index.php?l=en) for most of my programming in dynamic languages (IDE doesn't make a lot of sense for them and REPL is good enough for debugging). FAR is absolutely powerful tool, it doesn't rely on mouse AT ALL (though it supports it, of course).
Actually, FAR is the only application which still holds me on Windows. Alas, it's Windows-only :(
But the only reason that this looks so simple is that the JVM does all of the tricksy work of making virtual memory look like a big array of char. All the "ropes" (or whatever) and block buffering and whatnot is still there - it's just that application programmers these days don't have to write them.
re #17
- you know, the gap buffer could still work ... with a tree of gap buffers. That is: the leaf nodes are gap buffers. The non-leaf nodes *work* like gap buffers with respect to insertion and deletion of leaf nodes: they have an array of N pointers to nodes, with a gap. With java, you could even use weak and strong references to automatically cache unused leaf notes to tempfiles.
Re #19:
That's really not true.
For a 16 megabyte file - which is far larger than anything I expect to encounter in real use in an editor buffer - there's no virtual memory work going on to maintain the buffer. It's completely in a single contiguous memory segment. There's nothing fancy going on, in either the JVM or the OS virtual memory system. It really is this straightforward.
On the other hand, given modern perfomance characteristics, the "tree of gap buffers" is likely to perform very poorly - it takes what is conceptually a contiguous sequence of characters, and breaks it into pieces, which are scattered around in memory. The standard algorithms used by the processor cache for prefetching fail in that layout, and you wind up paying a very significant performance penalty as a result of cache misses.
It's a symptom of something really interesting: many of us who make our living as software engineers grew up in a world where performance tradeoffs were very different than they are today, and our instincts are often totally wrong. When I was in school, working as a sysadmin, a 20 megabyte disk drive was large. Now my cellphone has 8 gigabytes of storage! The machine under my desk has 8 gigabytes of memory. The cache on my new laptop is larger than the disk capacity of my college computer. It's strange.
With a bit more book-keeping you could lazily move the gap only when there's an insert. Ie if I do down/down/down/left 17 times, then type something, you could wait to move the gap until I try to insert something at that point.
Gap buffers have been around for ages. I believe that Expensive Typewriter used them on the old PDP 1-X with its Friden Flexowriters. It gibed well with paper tape oriented editors like TECO. Early TECOs didn't have a read file command, but rather a read another chunk, write another chunk primitive, so for larger files the user was responsible for paging.
The original EMACS was built on top of TECO which used gap buffers, and I implemented at least three quick baby EMACS using the technology. If nothing else, it was quick and easy.
Finseth's MINCE used a variant technology because he paged off of a floppy disk. It was sort of a BIBOP (big bag of pages) gap buffer scheme with the text buffer spanning pages, but each page having its own gap. (It is possible that later TECOs used a page buffer related scheme too).
I'm pleased that such a simple technology is returning to favor. I remember when terminal updates were expensive, and folks designed all sorts of complicated algorithms to minimize update times using hash codes, hinting algorithms and so on. I was lazy, so I just maintained a model of the screen and did a trivial scan for differences. To my amazement it worked well. I remember having long discussions about editors for the early IBM PCs. How could we optimize screen updates? It turned out to be trivial, just update from scratch on each keystroke. The user would never notice.
You may want to take a look at Evergreen. It's an Acme-like editor written in Java. Sounds like it might be what you're looking for.
http://software.jessies.org/evergreen/
Sounds like what would be called a 'zipper' in functional programming - the beauty being that the two segments can be stored in cons lists headed at the current position. Check out Gerard Huet's 'Functional Pearl' (OK, he did it for trees), and papers that generalize to differentiating data structures:
http://www.st.cs.uni-saarland.de/edu/seminare/2005/advanced-fp/docs/hue…
http://strictlypositive.org/diff.pdf
http://strictlypositive.org/dfordata.pdf
Mik