PARCO 2019 Additional Material

This document contains extra material for the PARCO Journal “Analysis of OpenMP 4.5 Offloading in Implementations: Correctness and Overhead”. These materials and resources we were not able to include in the original manuscript due to space limitation and scope of the paper.

It’s important to understand that the information that is found here is intended for our documentation and possible reproducibility of these results. However, we do not provide support to maintain the scripts that we have generated, nor we will be promissing any progress to this end. These results are not meant to be part of the testsuite yet.

How to run the results:

You will need the repo branch called “timing”

git clone https://github.com/SOLLVE/sollve_vv.git
git checkout timing

There are two running modes. CUPTI or Wall Time. 

WALL TIME:

It will only include the wall time for each experiment, ran multiple times. By default it runs 3 times, and discards the max and min. If you want your results to make sense you probably want to increase these to a larger number.

To increase the number of repetitions:

For wall time it is possible to increase the number of repetition by modifying the variable NUM_REP

For example, if you want to get the wall time, with 10 repetitions + 2 outliers discarded (hence the NUM_REP=12), then you can use

> make clean; rm -rf logs
> make CC=gcc CXX=g++ LOG_ALL=1 LOG=1  SOURCES=tests/target/test_target_timing.c NUM_REP=12 VERBOSE=1 VERBOSE_TESTS=1 all
> sys/scripts/parse2tabs.py -o convertedLog.txt logs/test_target_timing.c.log
> head convertedLog.txt
TestName	AVG_TIME	STD_DEV	MEDIAN	MAX_TIME	MIN_TIME
[target]	350.000000	0.000000	350.000000	384	350
[target_defaultmap]	343.000000	0.000000	343.000000	358	343
[target_dependvar]	722.000000	0.000000	722.000000	723	722
[target_device]	341.000000	0.000000	341.000000	350	341
[target_firstprivate]	717.000000	0.000000	717.000000	725	717
[target_private]	341.000000	0.000000	341.000000	354	341
[target_if]	340.000000	0.000000	340.000000	350	340
[target_is_device_ptr]	372.000000	0.000000	372.000000	384	372
[target_map_to]	379.000000	0.000000	379.000000	404	379
...

CUPTI:

Cupti provides a trace output. This has two implications.  1) it can only be run once, and we use an outer for loop to consider multiple iterations.  2) there’s some overlapping in the different traces, so the aggregated values of each of the cupti traces cannot be used, instead it is necessary to discard overlapping.

For CUPTI mode, you will need to make sure that you export the CUDA_HOME, enable CUPTI, and export the CUPTI dynamic libraries to LD_LIBRARY_PATH.

> export CUDA_HOME=/software/apps/cuda/9.2/
> export LD_LIBRARY_PATH=$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH

Then we run all the timing experiments. For the current state of the plot script it is necessary to have all the timing results, otherwise, you will need to comment out parts of the output plots.

> make clean
> rm -rf logs
> make CC=gcc CXX=g++ LOG_ALL=1 LOG=1  SOURCES=timing* VERBOSE=1 VERBOSE_TESTS=1 CUDA_CUPTI=1 all
> for i in logs/*; do cat $i >> allLogs.log; done

In this case, the trace is generated and stored in the logs/ folder. We support having multiple traces and there is a plot that allows you to do a histogram of these executions.  We use a different script that does the plot itself. The problem is that this script is not too “user friendly”. This is the one we used to generate the plots for the paper, and there are other plots. 

Script sys/scripts/plotResultsTiming.py

This script creates the plots that you find in the paper. We use some elements to include system information as well as the compiler version we used. allLogs.log is previously generated and has all the log traces aggregated:

> sys/scripts/plotResultsTiming.py -o test_file allLogs.log

Right now this will generate the plots, but it is not possible to obtain the raw data. However, we do have the option of creating a cache that can be used to avoid taking so long parsing the log results. This cache is a JSON file, which could also be used to extend the results.

> sys/scripts/plotResultsTiming.py -o test_file allLogs.log -c cache_logs.log

If you want to change the plots, it’s better to create a cache file. If the cache file does not exist it will parse the logs and create the cache file. If the cache file exists, it will use its content to generate the plots

Right now it is necessary to comment in and out the plots you want to output. By default, the above command is generating this list of png files: 

test_file_target_data.png
test_file_target_enter_data.png
test_file_target_exit_data.png
test_file_target.png
test_file_target_teams_distribute_combined_Vs_nested.png
test_file_target_teams_distribute_num_teams.png
test_file_target_teams_distribute_parallel_for_combined_vs_nested.png
test_file_target_teams_distribute_parallel_for_num_teams_num_threads.png
test_file_target_teams_distribute_parallel_for.png
test_file_target_teams_distribute.png
test_file_target_update.png

Finally the --debug NUMBER flag can be used to see what’s going on during the parsing of the files. The NUMBER changes the verbosity.