Data structure

What's the best data structure to design a text editor? like how pdfs and Ms word search for a sentence or a word and its multiple occurrences in a very short time. Obviously, they'd not be using KMP or something like that.
Why do you think something like KMP won't work?

There is a data structure called a rope which helps with insertion/deletion.
https://en.wikipedia.org/wiki/Rope_(data_structure)

Alternatively, there is the gap buffer:
https://en.wikipedia.org/wiki/Gap_buffer
There are plenty of open source editors.
Grab some code and study actual implementations.

text editor or word processor? It matters (markup inside vs plain text).
what is the goal (what trade offs will you make to be awesome at what else?)

finding duplicate text in small files is trivial for modern CPU even by brute force. And by small, I mean < 1-2 gb. If you want to be hot stuff on multi-gb files, you need to be a bit smarter. (word is dumb as a post on performance, by the way, for big files it chokes up).

Displaying the text as graphics is likely as much of your performance as the actual work, if not more. A good console program that just does the work (like grep, find all lines with blah in them) blows them away.
Last edited on
And by small, I mean < 1-2 gb.


I can fully confirm @jonnin's point here.

A simple text search of 1 Gbyte can take less than 1/5th of a second on most hardware this side of Sandy Bridge, if the text is already in RAM.

The problem gets nasty when you think about certain applications. Acrobat Pro, a PDF editor, is ( or was for along while ) just C++ code, so performance is fairly good.

Something like MS-Word, or even the text editor of Visual Studio, bring another dimension to the problem. I haven't bothered with MS-Word "programming" in a long while, but they once had a means of VB scripting, which means that the document text was available to the VB memory system, and that complicated performance matters (as @jonnin points out, it's a pig).

Visual Studio offers "add-on" components built in .NET (typically C#), which means THAT text is made available to the .NET system, and is thus not stored in the more efficient fashion normally associated with C++ high performance programming.

Another thought about the actual issues designing such editors is....editing.

We would easily assume that most text editors should expect to load the entire text into memory for editing, as the text is typically quite small (say, source code) relative to memory sizes.

However, text from an XML file could be huge, and a text editor may be required to view/edit it.

This complicates matters when support for files larger than RAM is required. There are many approaches, but among them include using the file itself as the initial source, creating an array of start/endpoints in that file depicting the line breaks (page breaks, other boundaries in formatted text like word processors), such that only the local view must be loaded into RAM (that which fits the display, with a margin above and below suitable to support scrolling). One good way of handling THAT design is to use memory mapped files, which makes the file look like RAM - with 'windowing' through the file as required.

Then...the edit point must be considered. That's where the cursor points and the user types something. This would naturally divide the source text into "before" and "after" the cursor point, as well as all data substituting the original text with new text. This can be supported with a small structure which basically looks like a splice - all the text before the point of editing, then the text that was editing, followed by the rest of the data which follows that position.

Naturally, though, as the user continues editing (bouncing around the file with cursor movements), this implies a tree of these "edit point" structures, creating a map which basically shows how the resulting file would be created by taking unedited source material from the original, substituting new text or altered (or removing deleted) text, splicing that to the next unedited block of text from source, and sweeping all through the changes made throughout the editing session.

Searching would have to be able to run through those mapped edit points appropriately, as would saving the edited file.






you are very thorough Niccolo! I enjoy your posts.



Topic archived. No new replies allowed.