Advice on creating a regex like tool

Hi guys,

So Regular expressions are normally written using Deterministic finite automata(DFAs) and Non-Deterministic finite automata, to my understanding I would imagine deterministic finite automata would be easier to implement in code. Ken Thompson wrote an algorithm for regex which eventually would be famously utilized in GREP.

This is the source I'm following - https://swtch.com/~rsc/regexp/regexp1.html

I sketched out a couple of diagrams (I'll attach a link to the sketches)
- https://ibb.co/Yyg193h
- https://ibb.co/9yvp0Rp

every automata can be translated to a regex expression

As you can see from my code I implemented a very simple automata that will check for a simple pattern such as if the user is looking for a occurrences of th ( regex -> th). But implementing harder regex expressions seem a little trickier such as th* or th+(a|i|o)

Also as you probably observed I create the automata first, and then traverse the automata to match an expression, if the accepted function( which traverses the automata) returns true then there is a match such as "th" in "there" or "the". I tested this with a paragraph and seems to work fine.

But again with more complicated regex expressions this will not be suffice, my thoughts straight away would be to turn the linked list into a graph, I will also need to test for special characters such as (*,+,(),|,/) here I'm guessing I could use a switch statement to examine the characters to test for a special character but do you think I should implement it in a function or a class method (such as in Automata or state)?

Here is my code so far, note I need to take care of memory management as of now I have many memory leaks in this program but will take care of that soon.

any advice or your own ideas on how to implement more complicated regex/automatas.

edit* come to think about it even this code is wrong, for example if I was searching for any words with occurrences with "th" it will only check if the word starts with "th", obviously this isn't right, it needs to search for occurrence of th.

Thanks

main
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112

#include <iostream>
#include "Automata.h"
#include "Paragraph.h"

using namespace std;

Automata* createAutomata(string str){

     int i = 0;
     State* one = new State(0,str.at(i),true,false,NULL);
     State* two;
     State* start = one;

     while(i < str.size()){

         i++;
         if(i == str.size()){

            two = new State(i,'%',false,true,NULL); // % means finished char
            one->next = two;
            one = two;
         }
        else{
            two = new State(i,str.at(i),false,false,NULL);
            one->next = two;
            one = two;
        }
     }
     Automata* at = new Automata(start,two->ste);
     return at;
}

// traverse the automata, if we get to the % character means we go to the end(accepting state)
bool accepted(string str,Automata* a){

     int i = 0;
     State* nxt = a->state;

     while(i < str.size()){

        if(str.at(i) == nxt->ch){// move to next state
           nxt = nxt->next;
           ++i;
           if(nxt->ch == '%')
            return true;
        }
        else
            return false;

     }
  return false;
}

// helper function / debugging function
void traverseAutomata(Automata* s){

   State* head = s->state;

   while(head != NULL){

     cout << head->ste << " ch == " << head->ch << endl;
     head = head->next;
   }
}

vector<string> searchWords(Paragraph& para,Automata* reg){

      vector<string> matches;

     for(int i = 0; i < para.words.size(); ++i){

        if(accepted(para.words.at(i),reg))
            matches.push_back(para.words.at(i));

     }
     return matches;
}

int main()
{
    Automata* reg = createAutomata("ABC");
    Paragraph para;
    string str;
    int i = 0;

    while(i < 3){

        cout << "enter a string " << endl;
        cin >> str;

        if(accepted(str,reg))
            cout << "accepted" << endl;
        else
            cout << "not accepted" << endl;
        ++i;
    }

    cout << "search :: enter regex you want to look for" << endl;
    cin >> str;
    Automata* rege = createAutomata(str);
    vector<string> matches = searchWords(para,rege);

    if(matches.size() == 0)
        cout << "no matches" << endl;
    else
    {
        for(int i = 0; i < matches.size(); ++i)
            cout << i << ") = " << matches.at(i) << endl;
    }
}



State.cpp
1
2
3
4
5
6
7
8
9
10
11

#include "State.h"

State::State(int ste,char ch,bool start,bool finish,State* next):ste(ste),ch(ch),start(start),finish(finish),next(next)
{}

State::State(){ ste = 0; ch = '%'; start = false; finish = false; next = nullptr;}

State::~State()
{}


Automata.cpp
1
2
3
4
5
6
7
8
9
10
11

#include "Automata.h"

Automata::Automata(State* state,int size):state(state),size(size)
{}

Automata::Automata(){size = 0;}

Automata::~Automata()
{}


Paragraph.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include "Paragraph.h"

Paragraph::Paragraph()
{
    input.open("words.txt");
    string str;

    while(!input.eof()){

        input >> str;
        words.push_back(str);
    }
}

Paragraph::~Paragraph()
{
}
Last edited on
Regular expression algorithms are notoriously difficult to get right, and I cannot recommend anything but a basic toy for a beginner. (Sorry.)

Remember that the symbols used in a regex directly relate to the DFA graph. For example, a "+" means that you have a loop on a single node (an edge that leaves a node and returns directly to the same node) or node grouping (an edge leaves the last node in a group and returns directly to the first node in the group). Hence:

    A+      : One or more 'A's: "AAAAA"
    (ABC)+  : One or more 'ABC's "ABCABCABC"

Your algorithm must be able to construct a graph that represents a valid DFA, which means that you may have any number of edges leaving from a single node. In other words, you need to think harder about your design.

Things get trickier when you consider negative look-ahead and look-back, if you want to implement things like that.

In order to avoid memory leaks and pointer hell, use C++'s ability to construct classes as you need them. I recommend using a structure like:

1
2
3
4
5
6
struct State
{
  std::string match_ranges;  // "AZaz" for example, or just "A" for a single character
  std::vector<std::size_t> transitions;  // lookup index into States
  ...
};

Keep your states in a vector and reference them by index:

 
std::vector<State> States;

This can be easily wrapped by a RegEx class of some sort.


Don't loop on EOF. Loop on whether or not your read attempt succeeded:

8
9
    while(input >> str){
        words.push_back(str)

If performing regular expression matching on input text, though, you should get all input as a single string:

1
2
3
4
5
6
7
std::string s;
{
    std::ifstream f("words.txt");
    std::stringstream ss;
    ss << f.rdbuf();
    s = ss.str();
}

This gets you everything, spaces, tabs, newlines, etc, all in a single string you can play with.

Good luck!
Regular expression algorithms are notoriously difficult to get right

Thompson's construction is the easiest of the bunch, and produces an NFA to do matching in O(n) to boot. So you're on the right track.
Last edited on
closed account (z05DSL3A)
adam2016, it might be worth taking a look at Compiler Design in C[1] by Allen Holub. It's freely available so will only cost you some time.

____________________________________________
[1] https://holub.com/compiler/
I've got the printed version of the book. Definitely worth a read for anyone involved with lexical/syntactical analysis. The book contains the complete c-source code for a subset of the c language (things like floating point excluded) - which is why it's so large.
Thanks guys, I didn't think it would be a hugely complicated task in building a regex parser but boy was I wrong, I may put this project on the long finger and make it when I get some more experience with graphs and algorithms.

I will definitely give that book a read too :)
Thanks, will do :)
Topic archived. No new replies allowed.