Strings C++

Pages: 12

TheIdeasMan wrote:
but isn't `R` in the first code snippet an xvalue?

Why yes, yes it is.

Would one need to return the type Set& instead?

Nope. You cannot return a reference to an expiring local. But that’s the beauty of an xvalue: the compiler recognizes that it can be simply move()ed to the caller. This is what you call RVO — Return Value Optimization.

Or make it an out parameter?

Meh, you could. Out parameters are ugly, IMO.

RE: words with spaces in them

There are no words with spaces in them. A space is, by definition, a word separator.

There are, of course, compound word phrases, which may or may not have spaces, but they are composed of multiple words.

A compound word, in contrast, has either no separator or uses a hyphen.

Last edited on

againtry (2313)

There are no words with spaces in them

workspaces?

Duthomhas (13310)

JLBorges (13770)

In English and many other languages using some form of the Latin alphabet, the space is a good approximation of a word divider (word delimiter), although this concept has limits ... Many English compound nouns are variably written (for example, ice box = ice-box = icebox; pig sty = pig-sty = pigsty) with a corresponding variation in whether speakers think of them as noun phrases or single nouns; there are trends in how norms are set, such as that open compounds often tend eventually to solidify by widespread convention, but variation remains systemic.
...
However, the equivalent to the word space character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where sentences but not words are delimited, Thai and Lao, where phrases and sentences but not words are delimited, and Vietnamese, where syllables but not words are delimited.

In some writing systems however, such as the Ge'ez script used for Amharic and Tigrinya among other languages, words are explicitly delimited (at least historically) with a non-whitespace character.

https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation

againtry (2313)

Aside from all that there are five points:
1. If a space, or whatever character(s), is in fact a valid part of the English alphabet then so be it. Just add it to the array/vector/set of your choice.
2. @Duthomas is pointing out the current norm in our language, that ice box for example is a noun phrase. Besides, hyphenation is also covered elsewhere here. Simply add the hyphen to the alphabet for those inclined.
3. Tigrinyan people are renowned for being completely unintelligble to outsiders. I doubt whether they would use other than an ASCII keyboard. Thais don't use a 26 letter character alphabet set at all while Lao is the same both based on Khmer. Chinese characters are pictograms, numbered in the thousands for a basic literate chinese and aren't an alphabet at all.
4. Let's hope we don't lose the spirit of @OP's request and go down the rabbit hole even more by mentioning, umlauts, accents and cedillas (ae?)
5. Meanwhile, I wonder what the Cyrillic, and Egyptians, even cuneiform script writers have to add with their hieroglyphic attempts at coping with compounds.

Last edited on

JLBorges (13770)

When different languages are involved, in most cases ICU's boundary analysis is quite handy.
https://unicode-org.github.io/icu/userguide/boundaryanalysis/

For word boundaries,

Words boundaries are identified according to the rules in https://www.unicode.org/reports/tr29/#Word_Boundaries , supplemented by a word dictionary for text in Chinese, Japanese, Thai or Khmer. The rules used for locating word breaks take into account the alphabets and conventions used by different languages.

With the caveat:

ICU’s break iterators are based on the default boundary rules described in the Unicode Standard Annexes 14 and 29. These are relatively simple boundary rules that can be implemented efficiently, and are sufficient for many purposes and languages. However, some languages and applications will require a more sophisticated linguistic analysis of the text in order to find boundaries with good accuracy. Such an analysis is not directly available from ICU at this time.

TheIdeasMan (6856)

@Duthomhas

Ah, I keep forgetting about RVO.

What I was thinking was the return type of const ref - that was supposed to extend the lifetime, but maybe that is gone since c++17 with the mandatory elision.

Topic archived. No new replies allowed.

Pages: 12