This document contains extra material for the PARCO Journal “Analysis of OpenMP 4.5 Offloading in Implementations: Correctness and Overhead”. These materials and resources we were not able to include in the original manuscript due to space limitation and scope of the paper.
It’s important to understand that the information that is found here is intended for our documentation and possible reproducibility of these results. However, we do not provide support to maintain the scripts that we have generated, nor we will be promissing any progress to this end. These results are not meant to be part of the testsuite yet.
You will need the repo branch called “timing”
git clone https://github.com/SOLLVE/sollve_vv.git
git checkout timing
There are two running modes. CUPTI or Wall Time.
It will only include the wall time for each experiment, ran multiple times. By default it runs 3 times, and discards the max and min. If you want your results to make sense you probably want to increase these to a larger number.
For wall time it is possible to increase the number of repetition by modifying the variable NUM_REP
.
For example, if you want to get the wall time, with 10 repetitions + 2 outliers discarded (hence the NUM_REP=12
), then you can use
> make clean; rm -rf logs
> make CC=gcc CXX=g++ LOG_ALL=1 LOG=1 SOURCES=tests/target/test_target_timing.c NUM_REP=12 VERBOSE=1 VERBOSE_TESTS=1 all
> sys/scripts/parse2tabs.py -o convertedLog.txt logs/test_target_timing.c.log
> head convertedLog.txt
TestName AVG_TIME STD_DEV MEDIAN MAX_TIME MIN_TIME
[target] 350.000000 0.000000 350.000000 384 350
[target_defaultmap] 343.000000 0.000000 343.000000 358 343
[target_dependvar] 722.000000 0.000000 722.000000 723 722
[target_device] 341.000000 0.000000 341.000000 350 341
[target_firstprivate] 717.000000 0.000000 717.000000 725 717
[target_private] 341.000000 0.000000 341.000000 354 341
[target_if] 340.000000 0.000000 340.000000 350 340
[target_is_device_ptr] 372.000000 0.000000 372.000000 384 372
[target_map_to] 379.000000 0.000000 379.000000 404 379
...
Cupti provides a trace output. This has two implications. 1) it can only be run once, and we use an outer for loop to consider multiple iterations. 2) there’s some overlapping in the different traces, so the aggregated values of each of the cupti traces cannot be used, instead it is necessary to discard overlapping.
For CUPTI mode, you will need to make sure that you export the CUDA_HOME
, enable CUPTI
, and export the CUPTI
dynamic libraries to LD_LIBRARY_PATH
.
> export CUDA_HOME=/software/apps/cuda/9.2/
> export LD_LIBRARY_PATH=$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH
Then we run all the timing experiments. For the current state of the plot script it is necessary to have all the timing results, otherwise, you will need to comment out parts of the output plots.
> make clean
> rm -rf logs
> make CC=gcc CXX=g++ LOG_ALL=1 LOG=1 SOURCES=timing* VERBOSE=1 VERBOSE_TESTS=1 CUDA_CUPTI=1 all
> for i in logs/*; do cat $i >> allLogs.log; done
In this case, the trace is generated and stored in the logs/
folder. We support having multiple traces and there is a plot that allows you to do a histogram of these executions.
We use a different script that does the plot itself. The problem is that this script is not too “user friendly”.
This is the one we used to generate the plots for the paper, and there are other plots.
This script creates the plots that you find in the paper. We use some elements to include system information as well as the compiler version we used. allLogs.log
is previously generated and has all the log traces aggregated:
> sys/scripts/plotResultsTiming.py -o test_file allLogs.log
Right now this will generate the plots, but it is not possible to obtain the raw data. However, we do have the option of creating a cache that can be used to avoid taking so long parsing the log results. This cache is a JSON file, which could also be used to extend the results.
> sys/scripts/plotResultsTiming.py -o test_file allLogs.log -c cache_logs.log
If you want to change the plots, it’s better to create a cache file. If the cache file does not exist it will parse the logs and create the cache file. If the cache file exists, it will use its content to generate the plots
Right now it is necessary to comment in and out the plots you want to output. By default, the above command is generating this list of png files:
test_file_target_data.png
test_file_target_enter_data.png
test_file_target_exit_data.png
test_file_target.png
test_file_target_teams_distribute_combined_Vs_nested.png
test_file_target_teams_distribute_num_teams.png
test_file_target_teams_distribute_parallel_for_combined_vs_nested.png
test_file_target_teams_distribute_parallel_for_num_teams_num_threads.png
test_file_target_teams_distribute_parallel_for.png
test_file_target_teams_distribute.png
test_file_target_update.png
Finally the --debug NUMBER
flag can be used to see what’s going on during the parsing of the files. The NUMBER
changes the verbosity.