And by small, I mean < 1-2 gb. |
I can fully confirm @jonnin's point here.
A simple text search of 1 Gbyte can take less than 1/5th of a second on most hardware this side of Sandy Bridge, if the text is already in RAM.
The problem gets nasty when you think about certain applications. Acrobat Pro, a PDF editor, is ( or was for along while ) just C++ code, so performance is fairly good.
Something like MS-Word, or even the text editor of Visual Studio, bring another dimension to the problem. I haven't bothered with MS-Word "programming" in a long while, but they once had a means of VB scripting, which means that the document text was available to the VB memory system, and that complicated performance matters (as @jonnin points out, it's a pig).
Visual Studio offers "add-on" components built in .NET (typically C#), which means THAT text is made available to the .NET system, and is thus not stored in the more efficient fashion normally associated with C++ high performance programming.
Another thought about the actual issues designing such editors is....editing.
We would easily assume that most text editors should expect to load the entire text into memory for editing, as the text is typically quite small (say, source code) relative to memory sizes.
However, text from an XML file could be huge, and a text editor may be required to view/edit it.
This complicates matters when support for files larger than RAM is required. There are many approaches, but among them include using the file itself as the initial source, creating an array of start/endpoints in that file depicting the line breaks (page breaks, other boundaries in formatted text like word processors), such that only the local view must be loaded into RAM (that which fits the display, with a margin above and below suitable to support scrolling). One good way of handling THAT design is to use memory mapped files, which makes the file look like RAM - with 'windowing' through the file as required.
Then...the edit point must be considered. That's where the cursor points and the user types something. This would naturally divide the source text into "before" and "after" the cursor point, as well as all data substituting the original text with new text. This can be supported with a small structure which basically looks like a splice - all the text before the point of editing, then the text that was editing, followed by the rest of the data which follows that position.
Naturally, though, as the user continues editing (bouncing around the file with cursor movements), this implies a tree of these "edit point" structures, creating a map which basically shows how the resulting file would be created by taking unedited source material from the original, substituting new text or altered (or removing deleted) text, splicing that to the next unedited block of text from source, and sweeping all through the changes made throughout the editing session.
Searching would have to be able to run through those mapped edit points appropriately, as would saving the edited file.