Best practice mutex in loop

Hello, I have written a program to analyze the href in html pages.
But I don't know where to put the mutex to optimize my code.
Wouldn't it be better to put it around my loop that parse my vector?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
boolean alreadyCrawled(string url)
{
    m_crawled.lock();
    for (auto i = crawled_vector.begin(); i != crawled_vector.end(); ++i)
        if(url == *i)
        {
            m_crawled.unlock();
            return true;
        }
    m_crawled.unlock();
    return false;
}



void parse(string url)
{
    vector<string> href = get_href(url);
    for (auto i = href.begin(); i != href.end(); ++i)
    {
        string new_url = format_url(*i);
        if(filter(new_url) == false && alreadyCrawled(new_url) == false){
            insertInJob(new_url);
        }
    }
}
This code has no reference to using multiple threads - so why a mutex?
I use multiple threads
I guess it just depends on what other threads could possibly do to the vector and what you're trying to do to it. If you're just reading from it a lock is probably unnecessary. If you need to write to it just lock out the element at the time of the write and then anything that is simply reading it is guaranteed to get the most recent changes. It will either wait to lock if other thread is reading that element then lock to write or vise versa. You don't want to lock out the whole vector unless it's required to synchronize with other threads. If these threads all revolve around this same vector or share other resources it may almost be pointless to have multiple threads. Just alot of factors with very little information provided.
But I don't know where to put the mutex to optimize my code.
You need to lock the mutex as long as the object to protect is used. So yes, you need to put it around the loop, since otherwise another thread could make the iterator invalid.

You may simplify this with lock_guard:
1
2
3
4
5
6
7
8
9
10
boolean alreadyCrawled(string url)
{
    std::lock_guard<std::mutex> lck (m_crawled);
    for (auto i = crawled_vector.begin(); i != crawled_vector.end(); ++i)
        if(url == *i)
        {
            return true;
        }
    return false;
}
This would also make your code exception safe.

By the way: Why do you use boolean? C++ has a type bool.
> I use multiple threads
Are you starting from some existing single threaded code, when on a whim you decided "I know, I'll add threads".

Was the code designed with concurrency in mind, or is it just some hacked together mess of global variables?

Seriously, you need a design up front, to tell you who owns what and in what circumstances information is shared between threads. Just hacking in some locks here and there is a recipe for failure.

What's your unit of concurrency? Concurrency will work better when you're working on something 'big' (like a domain), not something 'small' (like a URL). If your detail level is too small, you're going to end up locking and waiting more than you bargained for.

https://stackoverflow.com/questions/1944344/why-is-a-multithreaded-environment-considered-harmful

Badly implemented thread code will be full of these
https://en.wikipedia.org/wiki/Heisenbug
I'm no super c++ wizard but I've been around the block a time or 2 and the only thing that's going to invalidate an iterator is a new allocation via pushes or inserts or like resizing it. I've synced threads around vectors b4 and if threads aren't allocating or deallocating I've found in my experience simply locking a particular element to do a write operation is perfectly fine, especially if dueling threads have a similar work load they may bump here and there but after a slight pause of one or the other they will just flow down the loop in single file order. Obviously the direction of the loops being the same will alow that to happen. Locking out the whole vector if it's being accessed frequently is a prime example of a bad design bc one thread or the other will spend most time just waiting....vectors and iterators aren't some magical thing that randomly goes in an out. Iterators are just a class over top of vectors that grab the pointer to the element and return it in various ways and keep track of bounds etc. It's not a whole heck of alot more complicated than manually looping through pointers c style with a wrapper over top and a pretty bow. Also they will invalidate if you go out of scope bc they're just temp objects on the stack. I'm so used to just looping over pointers I don't usually bother with vectors etc unless it's very specific circumstances. I'm usually going for bare metal performance so every extra thing counts. My 2c but I'm fairly certain I've done similar things like this b4. Even if one thread is reading an element and it gets written to it shouldn't break its just that the reading thread won't see the updated data until the next time around the wagon.
You can start optimising by not passing objects by value.

You appear to be passing strings and vectors by value, I'm sure that's unnecessary.

Also, you're using an O(n2) complexity algorithm. Perhaps you can improve on that.
Last edited on
Registered users can post here. Sign in or register to post.