Writing my own String-Split function

Sep 26, 2011 at 6:52am
closed account (GzwXoG1T)
Hello everyone,

I have been programming in C++ for about six months now. I was helping a friend with a project when I came across this strange error. My split works as intended on the first two splits only.

Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
int findAll(string str, char delim)
{
	int size = 0, temp = 0, pos = 0;
	while ( (temp = str.find(delim, pos) ) > -1)
	{
		pos = 1 + temp; size++;
	}
	return size;
}

string* split(string str, char delim, int& outSize)
{ /* TODO: free this memory when finished!!! */
	outSize = findAll(str, delim);
	if (outSize == 0)
		return NULL;
	string* out = new string[outSize];
	int start = 0, find = 0;
	for (int index = 0; index < outSize; index++)
	{
		find = str.find(delim, start);
		if (find < 0)
			out[index] = str.substr(start);
		else
			out[index] = str.substr(start, find);
		start = find + 1;
	}
	return out;
} /* delete [] varName */


Call:
1
2
3
4
5
	int size;
	string* myArr = split("HE,LL,O!", ',', size);
	for (int index = 0; index < size; index++)
		cout << myArr[index];
	delete [] myArr;


Any help would be appreciated! Thanks for your time!

Full project with error: http://gyazo.com/0a66deb05b18e78ec7958ab5026c6801.png
Last edited on Sep 26, 2011 at 6:57am
Sep 26, 2011 at 2:23pm
Enter this after line 13:

std::cout << "outSize=" << outSize << "\n";

It won't give you the result you're expecting.
There are only two strings in your array because your findAll function finds only two commas/delimiters. It ignores anything that comes after that final comma.
Regards, keineahnung
Sep 26, 2011 at 10:09pm
closed account (GzwXoG1T)
Thanks I completely forgot about checking the outSize while debugging. Good catch.

Also another error I found was the delimiter was ending up in the output. Fixed by substracting the found position from the start position.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#include <iostream>
#include <string>

using namespace std;

int findAll(string str, char delim)
{
	int size = 0, temp = 0, pos = 0;
	while ( (temp = str.find(delim, pos) ) > -1)
	{
		pos = 1 + temp; size++;
	}
	return size + 1;
}

string* split(string str, char delim, int& outSize)
{ /* TODO: free this memory when finished!!! */
	outSize = findAll(str, delim);
	if (outSize == 0)
		return NULL;
	string* out = new string[outSize];
	for (int index = 0, start = 0, find = 0; index < outSize; index++)
	{
		find = str.find(delim, start);
		if (find < 0 || find < start)
		{
			out[index] = str.substr(start);
			break;
		}
		else
			out[index] = str.substr(start, find - start);
		start = find + 1;
	}
	return out;
} /* delete [] varName */

int main()
{
	int size;
	string* myArr = split("HE,LL,O!", ',', size);
	for (int index = 0; index < size; index++)
		cout << myArr[index];
	delete [] myArr;
	cin.get();
}


Thank you!
Sep 26, 2011 at 10:23pm
I have my own C split function where you can use a string as delimitor:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
char ** split(const char * text, const char * separator, int * count){
	unsigned int sep_pos = 0, i = 0, j = 0, k = 0, ccount = 1;
	char ** list = (char**)malloc(sizeof(char*));
	char *buffer;
	char * ntext;
	bool flag = true;

	if(strlen(separator) > strlen(text)){
		ntext = (char*)malloc(strlen(text)+1);
		memcpy(ntext, text, strlen(text)+1);
		list[0] = ntext;
		return list;
	}

	for(i=(strlen(text)-strlen(separator)); i<=strlen(text)-1; i++){
		if(text[i] != separator[j])
			flag = false;
		j++;
	}

	if(flag == false){
		ntext = (char*)malloc(strlen(text)+strlen(separator)+1);
		memset(ntext, 0, strlen(text)+strlen(separator)+1);
		strcat(ntext, text);
		strcat(ntext, separator);
	} else {
		ntext = (char*)malloc(strlen(text)+1);
		strncpy(ntext, text, strlen(text)+1);
	}

	for(i = 0; i <= (strlen(ntext)-strlen(separator)); i++){
		flag = true;
		k = 0;

		for(j=i; j<=(i+strlen(separator)-1); j++){
			if(ntext[j] != separator[k])
				flag = false;
			k++;
		}

		if(flag == true){
			list = (char**)realloc(list, (ccount*sizeof(char*)));
			int size_buffer = (i-sep_pos);
			buffer = (char*)malloc(size_buffer+1);
			int k = 0;

			for(j=sep_pos; j<=(sep_pos+size_buffer-1); j++){
				buffer[k] = ntext[j];
				k++;
			}

			buffer[k] = '\0';

			list[ccount-1] = buffer;

			sep_pos = i+strlen(separator);

			ccount++;
		}
	}

	free(ntext);

	if(count != NULL)
		(*count) = ccount-1;

	return list;
}


Example of use:

1
2
3
4
5
6
7
8
9
10
int main(){
	int count;
	char ** list = split("This[separator]is[separator]a[separator]test", "[separator]", &count);
	for(int i = 0; i<=count-1;i++)
		printf("%s\n", list[i]);

	free(list);

	return 0;
}
Sep 27, 2011 at 6:29am
closed account (GzwXoG1T)
@nadarST, that is pretty impressive. I will have to add a string delimiter overload into my function ASAP.

Edit, not as complicated in C++:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
#include <iostream>
#include <string>

using std::string;


int charCount(string str, string delim)
{
	int size = 0, temp = 0, pos = 0;
	while ( (temp = str.find(delim, pos) ) > -1)
	{
		pos = 1 + temp + delim.length(); size++;
	}
	return size + 1;
}

string* split(string str, string delim, int& outSize)
{ /* TODO: free this memory when finished!!! */
	outSize = charCount(str, delim);
	if (outSize == 0)
		return NULL;
	string* out = new string[outSize];
	for (int index = 0, start = 0, find = 0; index < outSize; index++)
	{
		find = str.find(delim, start);
		if (find < 0 || find < start)
		{
			out[index] = str.substr(start);
			break;
		}
		else
			out[index] = str.substr(start, find - start);
		start = find + delim.length();
	}
	return out;
} /* delete [] varName */

string* split(string str, char delim, int& outSize)
{
	return split(str, string(1, delim), outSize);
}

int main()
{
	int size;
	string* test = split("a[]b[]c[]d[]e[]f", "[]", size);
	for (int index(0); index < size; index++)
		std::cout << test[index] << std::endl;
	delete [] test;
	return 0;
}


:D
Last edited on Sep 27, 2011 at 6:57am
Sep 28, 2011 at 2:45am
Playing with string tokenization is something I enjoy. I've written about a billion different versions of it...

Here's something I really like that I wrote some time ago: it gets all not-empty tokens out:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
//----------------------------------------------------------------------------
#include <algorithm>
#include <functional>

template <typename StringType, typename Predicate, typename OutputIterator>
OutputIterator
split(
  const StringType& s,
  Predicate         p,
  OutputIterator    i
  ) {
  typedef typename StringType::const_iterator iter;
  iter rtok;
  iter ltok = std::find_if( s.begin(), s.end(), std::not1( p ) );
  while (ltok != s.end())
    {
    rtok = std::find_if( ltok, s.end(), p );
    *i++ = StringType( ltok, rtok );
    ltok = std::find_if( rtok, s.end(), std::not1( p ) );
    }
  return i;
  }

//----------------------------------------------------------------------------
#include <string>

struct contains: std::unary_function <char, bool>
  {
  const std::string& s;
  contains( const std::string& s ): s( s ) { }
  bool operator () ( char c ) const
    {
    return s.find( c ) != std::string::npos;
    }
  };

//----------------------------------------------------------------------------
#include <iostream>
#include <vector>

int main()
  {
  using namespace std;
  cout << "ssplit() tester. Press ^Z to end.\n";

  while (true)
    {
    string s_to_split;
    cout << "\nEnter the string to split: ";
    getline( cin, s_to_split );
    if (!cin) break;

    string s_delimiters;
    cout << "Enter the characters to split on: ";
    getline( cin, s_delimiters );

    vector <string> vs;
    split( s_to_split, contains( s_delimiters ), back_inserter( vs ) );

    for (size_t n = 0; n < vs.size(); n++)
      cout << "\"" << vs[ n ] << "\"\n";
    cout << vs.size() << " results\n";
    }

  cout << "Bye.\n";

  return 0;
  }

It is often useful to have empty tokens produced as well... the function is minimally modified:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
//----------------------------------------------------------------------------
#include <algorithm>
#include <functional>

template <typename StringType, typename Predicate, typename OutputIterator>
OutputIterator split(
  const StringType& s,
  Predicate         p,
  OutputIterator    i,
  bool empties_ok = true )
  {
  typedef typename StringType::const_iterator iter_type;
  iter_type rtok;
  iter_type ltok = empties_ok
                 ? s.begin()
                 : std::find_if( s.begin(), s.end(), std::not1( p ) );
  while (ltok != s.end())
    {
    rtok = std::find_if( ltok, s.end(), p );
    *i++ = StringType( ltok, rtok );
    ltok = empties_ok
         ? ((rtok != s.end()) ? (rtok + 1) : rtok)
         : std::find_if( rtok, s.end(), std::not1( p ) );
    }
  if (empties_ok and (rtok != s.end()))
    *i++ = StringType();
  return i;
  }

You'll notice that this particular function requires you to provide a structured container type. (You can't just pass, for example, a string literal.) That's a minor detail -- you can either provide overloads for it or just call it with the proper container constructor.

Also, as tokenization often occurs on textual data, the string class find functions can be employed instead of the generic algorithms... (But I'll leave that to you all.)

Enjoy!
Last edited on Sep 29, 2011 at 2:24am
Topic archived. No new replies allowed.