checking string

Forum

Forum
General C++ Programming
checking string

Pages: 12

chess (19)

How can I check that string name has at least one char, and has no char other than letters and hyphens?

Bazzy (6281)

To check whether it has character you can use the string::size method,
To check whether it contains only specific kinds of characters you can loop though all the characters and check using cctype functions like isalpha
Or use the method find_first_not_of

http://www.cplusplus.com/reference/string/string/find_first_not_of
http://www.cplusplus.com/reference/clibrary/cctype
http://www.cplusplus.com/reference/string/string/size/index.html

jsmith (5804)

For testing emptiness, using std::string::empty() is preferred to std::string::size() [as with all containers; the theory being that size() has to count ALL elements, whereas empty() can stop at 1].

alfa333 (9)

Does the class "string" store the size of the string, or it computes it upon every call to .size() by scanning the string to find terminating null, like C did?

If it computes the length, then perhaps (if you need only to check whether a string is non-empty, and nothing more) a faster method (especially if the string can be very long) is if(a[0]!=0) {...} (when a is the name of the string). Or more briefly: if(a[0]){...}.

As far as I know this operator does not verify whether the index is within the actual length of the string (while .at method would generate an exception when trying to access the element which is not within the string).

Bazzy (6281)

a faster method (especially if the string can be very long) is if(a[0]!=0) {...} (when a is the name of the string). Or more briefly: if(a[0]){...}.

std::strings are not necessarily null-terminated
Size should be stored in an internal member but I guess it may not be so depending on the implementation ( but not having it is quite unlikely )

alfa333 (9)

DISCARD MY POST ABOVE: jsmith's gave a better answer than I did. I started creating my post when his post was not sent yet, and when I completed my post, I did not refresh the page to see his post.

jsmith (5804)

Nonethess, your question is a valid one alfa333. std::string() has to in some way store the length because of what Bazzy said. The length could be stored as a separate member variable, in which case .size() and .empty() are equally efficient. I would suspect, without looking, that if std::string works like std::vector -- ie, have a reserve() method -- .size() would do a subtraction (of two pointers) and return the result, whereas .empty() would do a == comparison.

Anyway, that's why I said "in theory". The point is that .empty() will never be less efficient than .size(), but could quite probably be no faster. Therefore, .empty() would be preferred -- just in case.

alfa333 (9)

Thanks, jsmith. As described on http://www.cplusplus.com/reference/string/string/reserve/, there IS a string::reserve method, but it returns void, so can it help you? I am affraid, if you check the actual reserved memory, it can be larger than the actual string size (even it can be big when the string is empty) - this would be not reasonable to adjust allocation every time the string shrinks (and possibly will expand again in a near future).

I think, this should be documented, how objects are represented and how their methods act. Upon optimization this would cause me either to choose another way of programming or another library (or, desperatively, to implement my own class).

Assume, the length of the string is not stored "ready to read", but in a way similar to ASCIIZ. What, if I need to test whether my string is at least n characters long? If I use .size() or .length(), it will need to scan the whole string - waste of time. I could use .substr or .copy to copy first n characters to another string and verify whether the result is really n characters long or shorter. But if n is big, this is again a waste of time and of memory. If I could simply scan n characters, it would be much faster than copying them and next scanning them to determine the length of the resulting string.

Okay, I could use .at (or .substr or .copy) to obtain a single n-th character from the string - if it doesn't exist, an exception will occur. Fast, but I need to learn exceptions... A difficult way for doing simple things.

Besides, if I would like to store the length of my string in a separate place, are there any functions that give (as a side effect) the length of a processed string? I mean, e.g. if I use .substr to copy n characters (less if there are not as many), then the function .substr knows, how many characters were actually copied. But this information is being forgotten, while it could be placed in some global variable instead.

Bazzy (6281)

I think, this should be documented, how objects are represented and how their methods act. Upon optimization this would cause me either to choose another way of programming or another library (or, desperatively, to implement my own class).

The standard shows how each modifier method acts but it doesn't require any specific representation

Assume, the length of the string is not stored "ready to read", but in a way similar to ASCIIZ. What, if I need to test whether my string is at least n characters long? If I use .size() or .length(), it will need to scan the whole string - waste of time. I could use .substr or .copy to copy first n characters to another string and verify whether the result is really n characters long or shorter. But if n is big, this is again a waste of time and of memory. If I could simply scan n characters, it would be much faster than copying them and next scanning them to determine the length of the resulting string.

(IMO) No implementation would implement size as linear time for strings

Besides, if I would like to store the length of my string in a separate place, are there any functions that give (as a side effect) the length of a processed string? I mean, e.g. if I use .substr to copy n characters (less if there are not as many), then the function .substr knows, how many characters were actually copied. But this information is being forgotten, while it could be placed in some global variable instead.

Having library functions which modify global variables ( and make them available to the user ) would be sign of very bad design

jsmith (5804)

The standard documents runtime efficiency of various functions/algorithms, not internal details. It _should_ say that string::empty() is O(1). I don't know OTOH if it documents size() as O(1) or O(n). (std::string::size() and std::string::length() are the same.)

No, but you could easily accomplish this by creating your own iterator (though the iterator won't work with string's various methods since they take indices, not iterators) that wraps an output iterator and adds a count field which gets incremented each time operator= is called. After calling std::copy, then, you could check the count from the output iterator instance to see how many elements it copied.

simeonz (490)

Unfortunately nothing comes for free. I firmly believe that doing things the custom way is generally more effective, but less productive. Someone has to pay the piper. The blabber aside, the only container that has real reasons not to answer size() in constant time is list, because of its notorious splicing capability. All others can maintain their size in a variable. The specification will be fixed in C++0x according to the drafts:

size_type size() const;
1 Returns: a count of the number of char-like objects currently in the string.
2 Throws: nothing.
3 Complexity: constant time.

On a side note, I understand alfa333. Operations provided by libraries usually involve some inefficacy. It is unavoidable. This can sometimes be alleviated of course, with some architectural modifications, but this is not the point. Abstraction is not free, but it is cheaper than the alternative.

Regards

Last edited on

alfa333 (9)

You have calmed my nerves, that were upset after jsmith's post on Feb 20, 2011 at 7:44pm with "the theory being that size() has to count ALL elements, whereas empty() can stop at 1"; and Bazzy (on Feb 20, 2011 at 9:30pm) suspected that not all implementations store the size as a member.

About side effects, you are right: this would be better to return extra info in an optional parameter rather than in a global value. And this all does not apply to the case of size of string, if it is stored as a member (well, when I was using C, not C++ I wished such functions existed - and I remembered this yesterday).

About documentation: this is good that some info I wanted is available. When I need it, I will ask you how to find it (IMO, a link to the documentation should be given on http://www.cplusplus.com/reference/string/string/ and on each similar page).

But the complexity is not all what the programer needs to compare libraries and to optimize usage even of a single library. If one method is 1000 times slower than another, they both have the same complexity. So the coefficient in O( ) is important (although this is a question, how to ensure that two equally fast methods, tested on different computers would be documented with the same coefficient).

Even if methods have different complexity, the definition of complexity refers to the behavior when the data amount tends to infinity. We will never do computations on >10^80 data items (10^80 estimates the number of all protons and neutrons in the known universe, and this is still a finite number).

And in many applications we do computations on quite little sets of data, but must optimize because multiple such sets must be processed (for example a simulation of a process described by 20 or 30 parameters). For such sets the complexity is a worthless info. We used to believe that a polynomial complexity is always better than an exponential complexity. But for 30 items of data, complexity of 2^n, even with a big coefficient, can be better than a polynomial complexity like n^100 - which is still a polynomial...

Hence I would like computer scientists to create a more reliable benchmark than the complexity. And for now, I would like the library creators either
- to publish the source code, so that everyone could analyse the speed forecast in a particular case (this can be done for libraries distibuted for free);
- or to give a true estimating function for the number of instructions or sth like that. Because this is not enough is I know the highest term even with its coefficent, if for my amount of data a lower degree term with a bigger coefficient can give larger result. This perhaps will apply only to payed libraries.

Greetings

jsmith (5804)

@alfa333 - Sorry, I did not mean to upset you. We're actually in agreement on lots of what you say. Ever since I learned of big-O notation in college the cynical side of me has said exactly what you said -- the coefficient is important for exactly the reason you stated.

But I'm not sure what else library writers can do besides publish complexity. The problem with publishing source code is that you intend to customize your code around their implementation which means 1) if their algorithms change, your code changes too, to retain the same level of efficiency, and 2) you'll lose portability for the same reason as #1.

A true estimating function may be challenging for a similar reason: a function may take 100 instructions on one platform but 300 on another due to hardware-level constraints (number of registers available, instruction set). And timebase is obviously out too due to processor speed.

simeonz (490)

@alfa333
Theoretically, the library is currently free to do whatever it chooses, because the standard simply says:

size_type size() const;
Returns: a count of the number of char-like objects currently in the string.

There is no complexity guarantee, which means that the implementation can iterate the string seeking for the null character. Constant complexity guarantee may not tell you the exact performance measure, but it surely prohibits any such iteration.

That said, indeed no vendor in their right mind will make the method take linear time just to save 4 bytes. All containers can maintain a size field and update it incrementally after every operation with the exception of the list container, because there is no way to make the update after splicing.

Regarding asymptotic complexity - this is deep subject and I love to talk about it, but I doubt that the forum is the right medium. Generally those things are priority for the academia.

Asymptotic analysis is not aimed at implementations. And algorithms transcend the boundaries of specific platforms to which implementations must lock themselves. Today you use some algorithm with 1M elements, tomorrow with 1G elements on some new platform, etc.

Since you can say that your implementation is based on some algorithm, you can also say that it is based on algorithm with specific asymptotic performance. Of course, software runs in finite memory and all computations are finite sequences. And finite sequences of computations are not subject to asymptotic analysis. But this is extremist view. This is the same as arguing that I should never ask, "are you free tomorrow at noon", because the specific time may be important. Right, but you start somewhere and progress to finer detail if necessary.

Also, asymptotic analysis is just a tool. You are free to use it if it helps you. There is no point in arguing that it is not always helpful, because there is no magical wand like that.

Indeed, jsmith has properly explained the justifying properties of asymptotic results. The idea is that elementary operations cost differently on different platforms, and that the transition from algorithm to implementation is not invariant, so any choice of the implementer may result in slight alterations of performance. However, if two implementations are based on the same "recipe", they have something in common. And this is their scalability. Notice, that scalability has nothing to do with the processing of small quantities of data. Neither in science, nor in engineering.

Regarding the accurate prediction of performance. Science has already given you the tools for doing that - arithmetic. If you have the exact specification of your hardware and the exact output from your compiler, using simple arithmetic you can compute the worst execution time of your solution as function from the input size. Tough you say. That's why it's not so useful. It would be useful if you could do it. There is research in the area for performing this kind of analysis automatically. In fact, there is research to embed this type of analysis in the compiler and perform optimizations accordingly. But it is complex subject, beyond my current understanding.

You can always analyze the exact type of operations that your algorithm uses and count them. Algorithmic analysis frequently does that. Like, how many writes and reads, how many additions and multiplications, etc. It will not give you the execution time, but it is easier to do and is much more information than simply an asymptotic bound.

And last, but not least. All analysis is designed with pragmatism in mind. Not unscrupulous pragmatism, but pragmatism nonetheless. For example, comparison of numbers actually has logarithmic time complexity, but we frequently say that it takes constant amount of time. In the context of sorting algorithms for example, no one considers this aspect in relation to the comparison of indices. This is benevolent intentional sloppiness. The argument is, that if you have the guts to run huge volumes of data through a routine, you also have the means to provide hardware that compares the larger quantities involved at least as fast as the old hardware that you used in the previous age, when your problems were smaller.

Regards

simeonz (490)

All right, alfa333 is right after all - size() is supposed to have constant complexity for all containers, which includes strings. Even in the current standard there is guarantee on that:

Standard wrote:
expression complexity ... a.size( ) ... (Note A) ... Notes: the algorithms swap(), equal() and lexicographical_compare() are defined in clause 25. Those entries marked ‘‘(Note A)’’ should have constant complexity.

The last guarantee is broken for list in practice. It is recognized fact, and apparently there are different opinions on how this should be fixed. Some say that the standard should change. Others think that the implementation of splice should change. Apparently a controversial topic.

Regards

alfa333 (9)

Thanks, simeonz. Your posts explain most of my doubts and show problems with my ideas.

For libraries: what about publishing a chart of how the computation time of a particular function depends on the size of data? This chart can be easily obtained by a sequence of tests done by the library creator (even automatically!). This would tell more than the complexity. The axis scale needs not be linear, if another one is more reasonable.

The only problem is interplatform comparison: what is a single processor instruction on one platform, is possibly several instructions long on another. This can be especially a problem if a part of code is written in assebler. What is written in C++, can be analysed with a good accuracy on a basis of (not precisely definded) notion of elementary C++ operation.

There also could be a "standard C++ platform benchmark": a program that preforms e.g. 1000 additions, 500 assignments, 700 comparisons, 100 function calls etc. We measure the time of this program and this is considered the measure of speed of this platform (including the processor speed and type, the OS, version of compiler etc). The working time of this program would be a good unit to express the time of runs of other programs and this "scale" would be almost platform independent.

And you are right, that I led the discussion to deal with problems not of this forum, and especially not of this thread. Where to post ideas for library creators?

I also have some ideas for compiler creators (like telling the compiler "I expect 'warning 237' at line 1423 - don't show this warning for this line"). Where to post them?

Thanks in advance.

simeonz (490)

Generally, options for locally disabling and enabling warnings, as well as declaring warnings as errors (for a piece of code only), exist for most programming tools including compilers. For the C++ compilers, the method of choice is to use pragma directives. See gcc for example:
http://gcc.gnu.org/onlinedocs/gcc/Diagnostic-Pragmas.html

Now, this is not handled in the standard, but warnings are compiler-specific and to define them or the related facilities would be very restrictive, not to mention breed controversy. That's what pragmas are for - compiler specific extensions that alter the compilation of specific piece of code.

Using elementary operations for benchmarking (or some other abstract measure) will provide some platform independence. But the problem is, that the results still remain very implementation dependent. They will not remain the same from one code iteration of the library to the next. Why? Because the implementation may change for good reasons. It is true that this method will provide more information when dealing with smaller volumes of data and for ranking of routines. But it is just non-binding benchmark, not a promise to comply to performance guarantee (such as asymptotic analysis in relation to scalability for large inputs).

Here are some problems that you have to consider:

- The optimization of the code may erase some operations. For example, instead of using addition, some compilers use array indexing. As strange as it sounds, you can use the machine instructions for the latter to perform the former. Arithmetic that involves constants can frequently be simplified to shifting and bit-wise operations, which are orders of magnitude faster. Hardware may be able to perform SIMD operations. Some hardware will be able to do that with specific type of data, and other hardware will not be. First you have to consider these aspects when designing the abstraction, trying not to make it too bloated. And second, when you interpret the benchmark results you also need to take this into account.

- The library routines can be revised. This is normal, because the implementers may come up with better solutions for particular platform as I have explained in the last paragraph. And since you already have code based on the previous estimates, how do you react to the change. Apparently, no worse than with simply asymptotic performance guarantee, but there is simply lack of commitment here. And it is the commitment to certain specification that influences your design decisions.

- If you truly want your code to be portable, you have to consider the common denominator of the target platforms. But relative cost of operations is different on say, RISC and CISC processors. So, again the additional information will not hurt you, but it will not provide you with platform independent answer either.

I'm not arguing against it per se. I'm just saying that some problems can not be solved this way. Benchmarks of performance can always be provided even from third parties, and why not, but they will have to be revised frequently.

I've mentioned to you automatic analysis of execution time. If I understand it, the latter is applied for specific platform and specific code. So, if you have hard time constraint that you have to meet, like the time for response to a query on a network, you can directly analyze the worst possible duration of your routine and (with some tentativeness and additional tests) be assured that it wont exceed the maximum.

Where should you inquire library developers? I am not sure about that. To be honest, I'm a bit new to the web thing in terms of open communication. You can always address specific vendors. If you want to broadcast your discussion, I guess you should probably check some of the newsgroups - comp.lang or smth like that. The process seems bulkier and I never had the dare need, but you can try that. For discussion of asymptotic analysis, you should best address some lecturer or someone involved directly. I mean, Internet is lousy medium for detailed science discussions. Again, if you want to attract bandwidth, I guess you can google for computer science forums and take it from there.

Regards

EDIT:
Just thought of a good example of how operations get subsumed after compilation. Suppose we have this code:

    volatile int x = 0;
    volatile int y = 1;
    int a = x;
    int b = y;
    int c = a / b;
    int d = a % b;
    return c + d;

Division and remainder are both expensive operations. They are frequently used in pairs. So, sometimes when I want to optimize remainder after division, I am tempted to do this (hoping that multiplication and subtraction will be together cheaper):

    volatile int x = 0;
    volatile int y = 1;
    int a = x;
    int b = y;
    int c = a / b;
    int d = a - b * c;
    return c + d;

When compiling with g++ -O2 -S -fverbose-asm -masm=intel, the original code yields:

	mov	DWORD PTR [esp+12], 0	 # x,
	mov	DWORD PTR [esp+8], 1	 # y,
	mov	eax, DWORD PTR [esp+12]	 # a, x
	mov	ecx, DWORD PTR [esp+8]	 # b, y
	cdq
	idiv	ecx	 # b
	lea	eax, [edx+eax]	 # tmp64,

and the modified (supposedly optimized) code yields this:


	mov	DWORD PTR [esp+12], 0	 # x,
	mov	DWORD PTR [esp+8], 1	 # y,
	mov	ebx, DWORD PTR [esp+12]	 # a, x
	mov	ecx, DWORD PTR [esp+8]	 # b, y
	mov	eax, ebx	 # tmp65, a
	cdq
	idiv	ecx	 # b
	lea	ebx, [eax+ebx]	 # tmp67,
	imul	ecx, eax	 # tmp69, tmp65
	sub	ebx, ecx	 # tmp67, tmp69
	mov	eax, ebx	 #, tmp67

Now, you may notice that in the binary of the original code, there is no remainder instruction. There is just one idiv. That is because the remainder is byproduct from the division. The quotient can be found in eax and the remainder is in edx. In the modified code, instead of using the remainder that is already computed, I use a bunch of additional operations, and not only that, but utilize additional registers that must be saved on the stack and restored after. All this is sad business for my optimization, but shows how the mapping between source and binary is highly non-trivial.

Also, you can see here, how instructions for address arithmetic (lea) are used in place of integer addition. This is done for both the original and the modified version.

Actually, when you think about it, the remainder from division is indeed auxiliary result that you compute at each step when you divide two numbers. So the design of the instruction is not surprising at all. But that does not mean that computing remainder is cheap in general. It is cheap only when paired with some preceding division.

And something else that occurred to me, that I should mention. The standard does not specify the type and count of operations that some algorithm employs, but only with respect to fundamental (built-in) types. For the user types that are involved, the operations are usually specified. Since they are harder to optimize (if at all), and can actually involve some heavy processing, they are documented. For example in the specification of list modifiers:

Complexity: Insertion of a single element into a list takes constant time and exactly one call to the copy
constructor of T. Insertion of multiple elements into a list is linear in the number of elements inserted,
and the number of calls to the copy constructor of T is exactly equal to the number of elements inserted.

Last edited on

alfa333 (9)

Thanks, simeonz. I've browsed http://gcc.gnu.org/onlinedocs/gcc/Diagnostic-Pragmas.html . As far as I could understand it, the warning control is not specific enough:

I would like to be able to ban a single type of warning (e.g. when I really want to do if(i=n)..., I want to ban "Suspicious assignment" but keep other warnings possible in the same region of code).

And this would be best if inserting such pragmas could be automatized (this depends on IDE, not on the compiler itself): when I have a list of warnings generated in the compilation, I should be able to right-click one and choose "This issue is OK, don't talk about it here anymore", and the editor should surround corresponding line of code with pragma enable and disable. Of course this would be good to mark several warnings and
disable them with one right-click.

Preferably a shorter part of the code than a line could be the range of disabling a warning. Some pascal compilers allowed controlling options from specifically formatted comments: when a $ was the first character within a comment, that meant the comment is an option switch; in C++ /* */ comments could similarly switch options for very short parts of code. And the editor could make them invisible, if the user wants so.

The problem of portability does not apply: the need of disabling warnings is important during debugging, when you recompile the code again and again, and you don't read hundreds of warnings. If I could tell once per warning that this one is unimportant, I would sacrifice the time to do it, so that upon next compilations I would receive a few important warnings rather than hundreds, most of them unimportant.

For moving the code to another version of compiler, this should be possible to automatically remove all such "directives" (or optionally, some of them - this is to discuss, how to automatize selecting them).
--
Regarding the benchmarks and charts: I knew that this would be only an approximate way. Whether this is reasonable, this depends on the accuracy that can be achieved. And on truthfulness of the authors of the chart (this would either rely on their honesty or on verifiability).
But they could give at least a preliminary info, in an easier way than creating teen testing modules, each suitable to a different library or just a different function in the library.
Of course, upon every library improvement the charts should be updated. If the library comes with compiled modules (.obj), this also applies to recompilation with an improved compiler. This would not cause anybody to rebuild already created programs (nor even in the middle of creating), but would be important upon creating new ones.

And O( ) declarations are the same while one algorithm is slower than the other by a constant coefficient. Also if one algorithm works in O(n log n) time, while another works in O(n sqrt(log n)) time, the latter seems to be better. However without coefficients you cannot say which one is faster for as many as 1K elements, because sqrt(log 1K) == 5 (I assume base 2) - this can easily be less then the coefficient quotient.
You may rely on the O( ) difference when your n is as big that you believe that the quotient of the declared asymptotes (without coefficients) is larger than the quotient of coefficients. When comparing n^m with n^k, this is usually reasonable.
--
You wrote about div and mod. I have encoutered the problem some years ago (in C) and I wondered why there was no function (programmed in assembler) that performs these two operations and returns e.g. the quotient as the value and the remainder as a reference parameter. Or a "void" function could return them both in paramters. Or a struct could be returned (as the value of function), containing two integers (however I guess this would be slower to fill the struct and slower to read its members, than in the way of parameters).
As I understand your post, an optimizing compiler can notice that I am doing integer division and "mod" in consecutive instructions, and compute them both in one processor command idiv. That's good, but is this described in what a usual user can read when he/she starts using a compiler? When optimizing, I would like to know which operations are twice faster than they appear. Although twice makes no difference in the capital-O complexity, still I prefer an algorithm that is twice faster.

simeonz (490)

Although twice makes no difference in the capital-O complexity, still I prefer an algorithm that is twice faster.

I doubt anyone would argue against that. Here is what hypothetically could happen though. First, the vendors could publish detailed statistics. Second, the community or some organization could publish detailed statistics. Third, the programmers could perform detailed tests when they need them.

The first option is up to the vendors themselves. Describing a non-binding characteristic in the specification is a bit against contemporary methodology, because it actually makes your implementation opaque. There is a chance that this will backfire, because the software would be designed with very specific performance expectations in mind. If tomorrow the vendor decides to change the nuts and bolts, that would be felt much more extremely by its clients.

The community option is just difficult to maintain. I mean, someone will have to do the work for all the vendors. if you include embedded computing, then you have humongous range of possibilities. Who would just donate the resource for the cause? It is possible, but I dunno.

The third option is the most feasible. Knowing the financing method of projects today however, I would think that going into details will be the last resort, after doing what-have-you fails.

---

The compiler optimization strategy is difficult to convey. First, there is nothing that the "usual user" can comprehend about those things, me included. Second, even if I had the appropriate background, I still wouldn't read 1000 pages of manual that describe how the particular compiler operates. I just know some things, that I consider sufficiently common and important, and this gives me advantage in rare cases. But unless something applies rather universally, would you really want to go though hundreds of pages, of compiler analysis techniques to only get to knowledge that may or may not be useful.

I mean, the whole idea of programming C++, and not C, and not assembly, is to abstract above those things. If you wanted to know what exactly contributes to your execution time, then you should have used macro assembler instead. The code may not be as portable, but the entire purpose is to get some edge from micro optimizations, right? If you want to have complete control over the performance of some routine, you should provide your own implementation. Do you really want me to implement something for you, only so that you spend all the time in the world in gray-box analysis of my implementation choices. This is not interface based programming to say the least.

For me, the whole point of not doing something yourself, is to not check how it is done. A relative of mine once said: "If I wanted to explain to them exactly how to do it, I would have done it myself." The entire product is the information. If I acquire all the information, then what do I need the product for. This is also major problem with some software enterprises. By reuse they mean to take undocumented piece of code that someone in the company has written long ago and reverse engineer the hack out of it to understand how to use it. All you need is to have answers to strategical questions. Once you get into details, you are not reusing anything anymore. For example, I don't ask if my STL uses skip list or RB-tree. If I really prefer one over the other, I should implement it myself.

---

You want more intelligent IDEs. But let me start with something general and then I'll get to the warnings problem.

The C++ toolchain is, I think, flawed and retrograde by today's standards. Every intelligent IDE contains a C++ parser, a static analysis tool must contain a parser too, introspection utilities have semi-intelligent parsers, the compiler has a parser, the documentation extraction tool (if you use one) contains yet another parser. Besides the obvious duplication of effort, there are other problems with this, like consistency. If one of this tools evolves independently (to accommodate the new standard, to implement optional feature, etc), there is no guarantee that the tools up the chain will remain compatible, not to mention interoperable. This explains why people still use configurable text editors for development. It is just less hassle.

Another problem is, that since the format is plain, you must have unnecessarily complex syntax analysis that extracts additional semantics from the text. I can argue that plain text is not suitable for storage of user documents. All the scope resolution, collision evasion rules in the language. All of this is because the document is virtually unstructured, there is no metadata, and the meaning depends on the context. IDEs map the point of reference to the definition/declaration for browsing, re-factoring, etc. This information is not inherently supported in the source file, and it is either re-acquired every time or saved in auxiliary databases (which leads to potential versioning problems). The compiler performs the same duty all over again. The IDE can not load and host the parser in its process (and perform spell checks with it). The parsing done by the IDE can not be used by the compiler. Also, if I want some special pre-processing to the language, then the IDE will not recognize the new syntactical constructs.

My point is, instead of having reusable modules loaded in the tools (IDE, compiler, meta-compiler, etc), and plug-ins that extend those modules with custom syntax, and structured format that supports unambiguous queries, we pipe the tools with text as the communication and storage medium. There are some projects that try to fix this, like LLVM, but I think they plan to work in the confines of the compilation model, which (if true) is limiting. Also, some commercial solutions have tried to use databases as the permanent storage medium for source files, but there is no promise for financial return to the investors, and all this goes more or less silently under the radar.

Regarding the warnings. Indeed, it appears that disabling individual warnings is not supported in GCC at the moment. I have used static analysis tool that employs your strategy, but can not tell you what MS uses. The arguments of the GCC team are that warnings should be fixed, because the programmer is forced into a more responsible attitude. Of course, that is assuming that the warnings can be fixed. This is not always true. There are a few warnings that are simply attention grabbers, with no work around. You can see the relevant discussion here:
http://gcc.gnu.org/ml/gcc/2000-06/msg00638.html

You can use the comments for saving information. Doxygen uses them to store documentation, the version control systems use them, and some IDEs (like Emacs) use them for storing configuration options. First, if you decide to migrate from one tool to the next, these special comments become ordinary text. Second, folding works only if the IDE knows what to fold. Interleaving different varieties of information in a single source is IMO messy and highly non-interoperable. Compare this to using standardized extensible format that allows annotations. For example, a database (-like) format (even if it is XML database), so that the relations of the objects in the code can be captured.

---

I am sorry that I rambled so much. I for one understand, that there are many ideas out there, but only few of them will see the light of day. (I can hardly do anything before I read some more and acquire solid skills.)

Regards

alfa333 (9)

Thanks, simeonz. You have convinced me with many things. Still:

warnings should be fixed, because the programmer is forced into a more responsible attitude

Well, this sometimes enforces unoptimization, as in the given example (if(i=n)..., with a single "=").
Also a warning should be generated when the body of a loop is empty, although this is sometimes intended.

Regarding with migration, I suggested to be able to automatically remove all comments that switch options. They are not needed anymore upon moving the code to another platform, which is usually done when the initial version of the program works properly, right?
But I can give up with options within comments. Just let #pragma deal with each type of warning separately - this is (probably) what they talked about on
http://gcc.gnu.org/ml/gcc/2000-06/msg00638.html
you indicated (and the linked pages); they even suggested to include the name of a variable detected uninitialized.

And this probably is what is described on
http://msdn.microsoft.com/en-us/library/2c8f766e%28v=vs.80%29.aspx
and on http://www.dr-bill.net/CSC076/class_summaries/3-26/pragmas.htm
- I did not find it earlier. I wish this were in other compilers (for example: gcc, dev).

And I also would like to semiautomatically surround a line of code with #pragma disable and #pragme enable (or something with push/pop).

Last edited on

Pages: 12