2014年02月07日

マルチコアプログラミングで遊ぼう!Part 2

はてなブックマークに登録

マルチコアプログラミングで遊ぼう!Part 2



Introduction


前回の記事で日本語だったのですが、社内で英文でも良いじゃない?と言われたので、言葉にあまえて英語で失礼します。

Last time , we started to use our basic framework to execute independant tasks on multiple cores at the same time.
We did preliminary test to see how much performance we were getting with push/popping tasks around.

Now that we have a good measure of the cost of the system under load, we are going to look at some APIs and how we can design classes to use inside the framework.

Sample Workload

We are going to explore and play today with the most basic task.

In this case, a task must be independant from any other task and have no dependancy as they may execute on multiple cores at the same time.

But in this case, even we share a resource as read only, the task internal state does not interfere each other.

One point that could be adressed is a so called "false sharing" issue, but we are not going to discuss this today in detail.
Just know that a false sharing is the fact that multiple cores are accessing the same cache line, which disrupt global performance.

In our sample we are going to process a greyscale 8 bit image with two tasks.
    - One task performing the sum of all the values, compute the min and max value of the array.
    Implemented in class WorkSumMinMax.

    - One task performing the histogram of the image.
    Implemented in class WorkHistogram.



We will also introduce another feature of the framework : task dependancy.
We will gather the result after the complete execution of the two first tasks by defining a 3rd dependant task.

So, in todays article we are going to see two points :
- The benefit of running in parallel independant code.
- The ability of the framework to handle dependencies and handle task completion to spawn more work.

Here is a graphics explaining what is happening :

Execution Scheme

Let's code

Let's start with creating our parallel code :

#include "LX_MultiTasking.h"
#include <stdio.h>

// ===============================================================================
// Task computing the min/max and sum of a greyscale image.
//
class WorkSumMinMax : public Task {
public:
  WorkSumMinMax(u8* source, u32 width, u32 height)
  : m_width(width)
  , m_height(height)
  , m_greyScaleImage(source) {
    TSK_SetRunFunction((execFunc)&WorkSumMinMax::DoSumMinMax);
  }
  
  void DoSumMinMax(u8 worker, TaskParam* param);
  
  u8*    m_greyScaleImage;
  u32    m_width;
  u32    m_height;
  u32    m_sum;
  u32    m_min;
  u32    m_max;s
};

void WorkSumMinMax::DoSumMinMax(u8 worker, TaskParam* param) {
    // 初期化
    u32 size = m_width * m_height;
    u8* parse;
    u8* parseEnd= &m_greyScaleImage[size];
    
    u32 max = 0;
    u32 min = 255;
    u32 sum = 0;
    
    // 最大、最小、総和を計算
    parse     = m_greyScaleImage;
    while (parse < parseEnd) {
        u8 val = *parse++;
        sum += val;
        if (val > max)    { max = val; }
        if (val < min)    { min = val; }
    }
    
    m_max = max; m_min = min; m_sum = sum;
}

// ===============================================================================
// Task computing the histogram of a greyscale image.
//

class WorkHistogram : public Task {
public:
  WorkHistogram(u8* source, u32 width, u32 height)
  : m_width(width)
  , m_height(height)
  , m_greyScaleImage(source) {
    TSK_SetRunFunction((execFunc)&WorkHistogram::DoComputeHistogram);
  }
  
  void DoComputeHistogram(u8 worker, TaskParam* param);
  u8*    m_greyScaleImage;
  u32    m_width;
  u32    m_height;
  
  u32    m_histogram[256];
};

void WorkHistogram::DoComputeHistogram(u8 worker, TaskParam* param) {
    u32 size = m_width * m_height;
    u8* parse;
    u8* parseEnd= &m_greyScaleImage[size];
    
    u32* histogram = m_histogram;
    
    for (int n=0; n<256; n++) {
        histogram[n] = 0;
    }
    
    parse     = m_greyScaleImage;
    while (parse < parseEnd) {
        u8 val = *parse++;
        histogram[val]++;
    }
}

// ===============================================================================
// This task use the result of the 2 first tasks to create 2 histograms seperated by the average pixel value
//
class WorkSeperate : public Task {
public:
  WorkSeperate(WorkHistogram* taskA, WorkSumMinMax* taskB)
  : m_histo (taskA)
  , m_sum   (taskB)
  , m_complete(false)
  {
      TSK_SetRunFunction((execFunc)&WorkSeperate::SeperateHistogram);
  }
  
  void SeperateHistogram(u8 worker, TaskParam* param);
  
  WorkHistogram* m_histo;
  WorkSumMinMax* m_sum;
  u32 m_histogramLow [256];
  u32 m_histogramHigh[256];
  bool m_complete;
};

void WorkSeperate::SeperateHistogram(u8 worker, TaskParam* param) {
    // 画像内の平均値を計算
    u32 average = (m_sum->m_sum) / (m_sum->m_width * m_sum->m_height);
    
    // 初期化
    for (int n=0; n<256; n++) {
        m_histogramLow[n] = 0; m_histogramHigh[n] = 0;
    }
    
    u32* histoData = m_histo->m_histogram;
    for (int n=0; n < 256; n++) {
        if (n < average) { m_histogramLow [n] = histoData[n]; }
                       else { m_histogramHigh[n] = histoData[n]; }
    }
    
    m_complete = true;
}

... Main Function Here... (See full source link just after)

Run the program


Now we run the code (framework is here (PLEASE USE THE LATEST COMMIT as of today 2014/02/07, updated since last article) and use the following : complete source for this article is here)
and find the following performance :

Here is the log for 1 core(CreateWorker with 1 as parameter), so all code is serial.
Here is the log for 2 cores(CreateWorker with 2 as parameter), so tasks are executed in parallel.

We obtain the following graph: microsecond of execution for each iteration with one and two cores
Execution time with one and two cores

With one core, the base line is around 250,000 uSec for all our tasks to complete.
With two cores, the base line is around 126,000 uSec for all our tasks to complete.

We can see that the base lines are stable enough that the spike in the chart are most likely due to issue like OS thread interruption and other things running in the background during the benchmark.

We notices that by running two cores, the execution time is divided by two.
Which means two things :

  • Each task roughly need the same cpu time.
  • Both task are not memory bounded or memory bounding each other.

In the case of a single worker we have :
Worker 0 : [Task A][Task B][Task C]

In the case of a double worker we have :
Worker 0 : [Task A or B]
Worker 1 : [Task B or A][Task C]
or
Worker 0 : [Task A or B][Task C]
Worker 1 : [Task B or A]

Note that Task C is negligible, thousand times smaller than A or B.

If A or B were different enough in execution time, we should end up with something like :
[Task A]
[Task B -----][Task C]

Now we take the execution log for 2 cores and take the execution time for each task and here is the graph :

- Cumulated time : Sum + Histogram per iteration in microsecond.


- Seperate : Sum and Histogram per iteration in microsecond.


Conclusion


As expected, we verified that each task as roughly the same execution time, with an average few % more for the histogram task over the sum. So while quite obvious, we showed that our framework allow to easily spawn write-independant tasks over multiple cores and it speeds up application quite efficiently. One could note that no mutex or lock has been necessary to synchronize both task, neither polling for the last task to gather the information. The system is efficient as it can be.

Next Time


Next time we will change this sample and convert it into a stream task and try to see if we can improve efficiency while still using multiple cores at the same time.

klab_gijutsu2 at 18:55│Comments(0)TrackBack(0)

トラックバックURL

この記事にコメントする

名前:
URL:
  情報を記憶: 評価: 顔   
 
 
 
Blog内検索
Archives
このブログについて
DSASとは、KLab が構築し運用しているコンテンツサービス用のLinuxベースのインフラです。現在5ヶ所のデータセンタにて構築し、運用していますが、我々はDSASをより使いやすく、より安全に、そしてより省力で運用できることを目指して、日々改良に勤しんでいます。
このブログでは、そんな DSAS で使っている技術の紹介や、実験してみた結果の報告、トラブルに巻き込まれた時の経験談など、広く深く、色々な話題を織りまぜて紹介していきたいと思います。
最新コメント