Parallelism (std::thread)

krevedkonet · August 7, 2019, 12:57pm

Hi there!
I tried to study parallelism with std::thread. I read a lot of topics and examples on this theme, but I just can’t implement this in my program.

For starters, description:
Since creating and deleting threads is a labor-intensive process for the CPU, I decided to create a vector with a given number of threads in Level1.cpp:
Level1.h

class Level1 : public cocos2d::Layer
{
public:
	static cocos2d::Scene* createScene();
	virtual bool init();
	CREATE_FUNC(Level1);
	void update(float dt) override;

private:
// ... (declaration of variables)

	std::vector<std::thread> threadsPool;
	std::mutex mtx;
};

in Level1.cpp

bool Level1::init()
{
	if (!Layer::init())
	{
		return false;
	}

// ... initializations

	int numThreads = std::thread::hardware_concurrency();
	threadsPool.resize(numThreads);

	this->scheduleUpdate();
	return true;
}

void Level1::update(float dt)
{
// ... other calculations
    vai->vehicleAIMovement(dt, vehiclePlayerPosition, vehicleAIList, WayPointsList, it, threadsPool, mtx);
}

(vai - object of VehicleAI class)
VehicleAI class class has 3 functions:
vehicleAIMovement() - function called from Level1::update(float dt)
trackingTrajectory() - function called from vehicleAIMovement()
trackingOtherVehicles() - function called from trackingTrajectory()

so in VehicleAI.cpp

float VehicleAI::trackingTrajectory(cocos2d::Vec2 vehiclePlayerPosition, std::vector<std::shared_ptr<VehicleAI>> vehicleAIList, int iterator, float V, float dt, std::vector<std::thread>& threadsPool, std::mutex& mtx)
{
    // ... many many calculations

    	int iterator1 = 0;
	// check every object in the list
    	while (vehicleAIList.size() > iterator1)
	{
		// except itself
		if (iterator1 != iterator)
		{
			auto otherAIVehicle = vehicleAIList.at(iterator1);
			for (auto&& thread : threadsPool)
			{
				if (thread.joinable())
				{
					continue;
				}
				else
				{
					thread = std::thread([=, &boolStopVehicle, &allVehiclesOnCrossroadStopped, &distNearestVehicle] {trackingOtherVehicles(otherAIVehicle, FM1, RM1, FR1, FL1, RR1, RL1, iterator1, F1, V1, W1, L1, minStoppingDistance, decelerationDistance, STP, std::ref(boolStopVehicle), std::ref(allVehiclesOnCrossroadStopped), std::ref(distNearestVehicle)); });
				}
			}
		}
	     iterator1++;
    }

}

trackingOtherVehicles() function should overwrite the “boolStopVehicle” variable, and a few more.
In this form, the code does not work correctly: the function ceases to be called when all threads initially receive their task.
(By the way, the number of threads is much smaller than the size of the “vehicleAIList” vector)
Trying to solve the issue of thread congestion, I came to this conclusion:

    int iterator1 = 0;
	// check every object in the list
	while (vehicleAIList.size() > iterator1)
	{
		// except itself
		if (iterator1 != iterator)
		{
			auto otherAIVehicle = vehicleAIList.at(iterator1);
			bool allThreadsRunning = true;
			while (allThreadsRunning)
			{
				for (auto&& thread : threadsPool)
				{
					if (thread.joinable())
					{
						if (allThreadsRunning)
						{
							thread.join();
						}
						else
						{
							continue;
						}
					}
					else
					{
						std::lock_guard<std::mutex> lock(mtx);
						thread = std::thread([=, &boolStopVehicle, &allVehiclesOnCrossroadStopped, &distNearestVehicle] {trackingOtherVehicles(otherAIVehicle, FM1, RM1, FR1, FL1, RR1, RL1, iterator1, F1, V1, W1, L1, minStoppingDistance, decelerationDistance, STP, std::ref(boolStopVehicle), std::ref(allVehiclesOnCrossroadStopped), std::ref(distNearestVehicle)); });
						allThreadsRunning = false;
					}
				}
			}
		}
		iterator1++;
	}

But it still doesn’t work correctly and does not provide any parallelization benefits. The program starts to slow down even more.
It’s obvious that I’m doing something wrong, but I don’t understand how to handle it.
I hope for the help of the guru)

R101 · August 7, 2019, 1:37pm

Can you please add 3 back-quotes (the character above the tilde ~, which is this `) before and after each section of code, since it’s really hard to read your post.

krevedkonet · August 7, 2019, 1:59pm

Sorry for the inconvenience, I corrected everything

slackmoehrle · August 7, 2019, 5:24pm

Cocos2d-x isn’t thread safe. There are things you can do but needing to update a Cocos2d-x object needs to be done in the GUI thread. See the Director class for details.

tdebock · August 7, 2019, 7:34pm

Check the differences in std::thread between .join() and .detach()?

I would think join() inside a loop will have some wait time associated that I think you are trying to avoid.

Also as @slackmoehrle mentioned… if you have any rendering etc, you need to dispatch back to cocos thread.

slackmoehrle · August 7, 2019, 8:19pm

I store all my threads in a vector so I can get back to them if detached. It might be best to build a pool but for my needs I know how many threads I create by the size of the vector and each vectors position has a class I created using ECM with a thread in it. I store other info that identifies the thread. I also have a map that tells me what each thread is doing at any given time.

tdebock · August 8, 2019, 6:01am

while (vehicleAIList.size() > iterator1)
	{
		// except itself
		if (iterator1 != iterator)
		{
			auto otherAIVehicle = vehicleAIList.at(iterator1);
			bool allThreadsRunning = true;
			while (allThreadsRunning)
			{
				for (auto&& thread : threadsPool)
				{
					if (thread.joinable())
					{
						if (allThreadsRunning)
						{
							====> thread.join(); <====

@slackmoehrle @krevedkonet, I was referring to this .join() within the loop. From what I understand, the thread calling this, ie the one running the while loop, will have to wait for that thread to complete execution before moving to the next iteration in the loop, defeating the purpose that I assumed you wanted. A .detach() will let the loop keep running.

I made a simple example, compile this with c++11, then run the executable and see how the joins inside the loop block the thread they are in, thus holding the loop, whereas the detach lets the loop continue iterating:

#include <iostream>       // std::cout
#include <thread>         // std::thread, std::this_thread::sleep_for
#include <chrono>         // std::chrono::seconds
 
void pause_thread(int n) {
  std::this_thread::sleep_for (std::chrono::seconds(n));
  std::cout << "pause of " << n << " seconds ended\n";
}
 
int main()  {

  for (unsigned i = 0; i < std::thread::hardware_concurrency(); i++) {
      std::thread(pause_thread, 1).detach();
      std::cout << "called a detach\n";
  }

  pause_thread(3);

  for (unsigned i = 0; i < std::thread::hardware_concurrency(); i++) {
    std::thread(pause_thread, 1).join();
    std::cout << "called a join\n";
  }

  pause_thread(3);

  return 0;
}

tdebock · August 8, 2019, 6:14am

You can also instead just break it up into the concurrent thread count.

ie:

there is a loop of 100 calculations.
there are 4 concurrent threads.

You can send 25 to each concurrent thread in a simpler manner than you have set up.

Then with proper concurrent thread detaching, you can run 4 at a time 25 times.

krevedkonet · August 8, 2019, 8:34am

Maybe I do not quite understand the mechanics of the threads. It seemed to me that with the help of .detach(), the thread ceases to exist as an object when it finishes its work, and in the vector-pool of threads it is necessary to re-create the thread object, which, as I said, is resource-intensive.
It seemed to me that the thread can be controlled independently and that it can be given work, and then when the thread finishes its task, it goes into standby mode and again it can be given a task.

I want to parallelize only mathematical calculations, since there are a lot of them in this function, and this can play a significant role in scaling the number of processed objects.
Although I did a load test and realized that the main processor with a maximum frequency of 2.4 GHz has a margin of 100,000 light operations. But this is the processor of my computer, and the phone may not have such computing power. Moreover, my program has not yet been completed and other calculations and features should appear in it.

The fact is that the number of objects in the list is constantly different. Objects are generated and added there under certain conditions)
I also saw such an example and I can repeat it, but initially I wanted to do without unnecessary calculations - how many operations a single thread can do. Because I thought to use all the cores (excluding logical processors), and their number can be either 1 or 2, or 3, or 4, and so on.
Such an example can be relatively easy to use if you specify the number of threads no more than 2)

Based on what I wrote - do I understand the principle of work correctly? Please correct me if I am wrong

tdebock · August 9, 2019, 3:26am

There are several clean and efficient thread pool libraries on GitHub for C++:

I would suggest to look around in there for examples, or libraries to implement.

edit: Your understanding seems correct. I was trying to point out the flaw of joining threads inside that loop causing the calling/current thread to wait. I suggest to look into thread pools, the libraries you can find on GitHub will eliminate the headache of setting up a nice system.

Never reinvent the wheel…