GPU benchmarks - discussion
0
-
Gathered GPU benchmark data:
https://phodograf.com/capture-one-benchmarks/
Additions are welcome, thanks!0 -
WPNL wrote:
Gathered GPU benchmark data:
https://phodograf.com/capture-one-benchmarks/
Additions are welcome, thanks!
That is a good idea to collect them all in one table! Is it possible to make independent data updates so everyone can post them by himself?0 -
We can collect the data in your other topic, I'll gather it up to a point when there's 'enough'.
I thought about an open document but with this there is no (or less) chance of it disappearing because someone made mistake or tried to be funny..0 -
Yes, that is possible that someone could erase all the data. It would be nice if there will be some kind of form with 'submit' button. So people can only add new values but won't be able to delete them. 0 -
I've added a form to the page, check it out 😉
It's not real-time but it will have to do for now... Let's see how many rows we can collect!0 -
Great! Looks very nice 😄
Added new bench value using this form0 -
garrison wrote:
Great! Looks very nice 😄
Added new bench value using this form
Thanks! Not bad for a start I guess 😊
I've received the mail and entered the data right away (couldn't keep you waiting haha)0 -
i am wondering if the set monitor resolution is making any difference in these benchmarks...anybody knows? 0 -
NNN636372919529824193 wrote:
i am wondering if the set monitor resolution is making any difference in these benchmarks...anybody knows?
It does, but very marginally. It is also affected by the number of monitors connected to the specific card, again, very marginally.0 -
I was just about to add my data when I discovered I have two OpenCL Bench mark numbers. Here is the section of the log (with the two lines in bold):
2018-01-17 05:57:23.923> OpenCL : Loading kernels
2018-01-17 05:57:24.152> OpenCL : Loading kernels finished
2018-01-17 05:57:24.152> OpenCL : Benchmarking
2018-01-17 05:57:24.177> Started worker: TileExecuter 0 [unknown] (master: 2660, worker: 2d90)
2018-01-17 05:57:24.390> Shutting down: TileExecuter 0 [unknown] (master: 2660, worker: 2d90)
2018-01-17 05:57:24.390> Ending worker: TileExecuter 0 [unknown] (master: 2660, worker: 2d90)
2018-01-17 05:57:24.392> OpenCL : Initialization completed
2018-01-17 05:57:24.392> OpenCL benchMark : 0.833440
2018-01-17 05:57:24.850> First chance exception (thread 14248): 0xE06D7363 - C++ exception
2018-01-17 05:57:25.827> Started worker: TileExecuter 0 [unknown] (master: 2660, worker: 2654)
2018-01-17 05:57:25.975> Shutting down: TileExecuter 0 [unknown] (master: 2660, worker: 2654)
2018-01-17 05:57:25.975> Ending worker: TileExecuter 0 [unknown] (master: 2660, worker: 2654)
2018-01-17 05:57:25.977> OpenCL benchMark : 0.258960
2018-01-17 05:57:43.416> First chance exception (thread 10536): 0xE06D7363 - C++ exception
2018-01-17 05:57:43.416> (9 identical messages logged; delayed 20.666s .. 20.573s.)
2018-01-17 05:57:43.416> First chance exception (thread 14584): 0xE06D7363 - C++ exception
2018-01-17 05:57:43.416> (29 identical messages logged; delayed 20.391s .. 20.027s.)
2018-01-17 05:57:43.416> First chance exception (thread 3664): 0xE06D7363 - C++ exception
2018-01-17 05:57:43.416> (9 identical messages logged; delayed 20.100s .. 19.772s.)
2018-01-17 05:57:43.416> OpenCL CL_DEVICE_GLOBAL_MEM_CACHE_TYPE : 2
2018-01-17 05:57:43.416> (Message delayed 19.704s to prevent duplicates.)
2018-01-17 05:57:43.416> OpenCL CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE : 65536
2018-01-17 05:57:43.416> (Message delayed 19.704s to prevent duplicates.)
2018-01-17 05:57:43.416> OpenCL CL_DEVICE_IMAGE_SUPPORT : 0
2018-01-17 05:57:43.416> (Message delayed 19.704s to prevent duplicates.)
2018-01-17 05:57:43.416> OpenCL CL_DEVICE_ADDRESS_BITS : 64
2018-01-17 05:57:43.416> (Message delayed 19.703s to prevent duplicates.)
2018-01-17 05:57:43.416> OpenCL : Loading kernels
2018-01-17 05:57:43.416> (Message delayed 18.941s to prevent duplicates.)
2018-01-17 05:57:43.416> First chance exception (thread 14248): 0xE06D7363 - C++ exception
2018-01-17 05:57:43.416> (4 identical messages logged; delayed 18.566s .. 18.566s.)
2018-01-17 05:57:43.416> OpenCL : Loading kernels finished
2018-01-17 05:57:43.416> (Message delayed 17.589s to prevent duplicates.)
2018-01-17 05:57:43.416> OpenCL : Benchmarking
2018-01-17 05:57:43.416> (Message delayed 17.589s to prevent duplicates.)
2018-01-17 05:57:43.416> OpenCL : Initialization completed
2018-01-17 05:57:43.416> (Message delayed 17.438s to prevent duplicates.)
2018-01-17 05:57:43.416> First chance exception (thread 14248): 0xE06D7363 - C++ exception0 -
IanL wrote:
I was just about to add my data when I discovered I have two OpenCL Bench mark numbers. Here is the section of the log (with the two lines in bold):
I'd guess that's a laptop with embedded and discrete GPU? You can verify this in ImgCore.log0 -
It is not a laptop but the mother board does have a GPU - which I though was disabled. I am certainly not using it.
Confirmed in imgcore.log:
Device 0 : Intel(R) HD Graphics 4600
Driver Version : 20.19.15.4835
I will have to look into turn really turning that off - no need for it. How can I tell which bench mark belongs to which GPU? I looked at the log and while imgcore.log identifies them as device 0 and device 1 the bench mark entries do not make it clear which GPU is which.0 -
Looking at the recorded bench marks from other people I think I can guess which is which 😊 0 -
IanL wrote:
Looking at the recorded bench marks from other people I think I can guess which is which 😊
The smaller value is for better GPU 😄0 -
The question for C1 team 😊
Will C1 use both GPUs for hardware acceleration (display and processing) if they installed together? For example, integrated and discrete GPUs in laptop. Or two GPUs in one PC in SLI/CrossFire/Independent mode?
Thanks.0 -
CO Can use up to 4 GPU’s.
It doesnt matter whether these are connected to a display or not, as long as they support OpenCl, have above 2 gb of ram, and aren’t too slow.
It will benefit processing-speeds, if the benchmarks of different installed cards aren’t more than 4x times slower than the fastest card.
There’s no point in SLI\Crossfire for CO, but it will work fine with it, if needed for gaming or other purposes.0 -
Christian Gruner wrote:
CO Can use up to 4 GPU’s.
It will benefit processing-speeds, if the benchmarks of different installed cards aren’t more than 4x times slower than the fastest card.
How do you know if the GPU is 4 x slower/faster?
How does the calculation work, is 0.01 twice as fast as 0.02? And what does 1.0 stand for (that seems like the reference?)0 -
WPNL wrote:
Christian Gruner wrote:
CO Can use up to 4 GPU’s.
It will benefit processing-speeds, if the benchmarks of different installed cards aren’t more than 4x times slower than the fastest card.
How do you know if the GPU is 4 x slower/faster?
How does the calculation work, is 0.01 twice as fast as 0.02? And what does 1.0 stand for (that seems like the reference?)
The benchmark is linear, so 0.1 is double as fast as 0.2.
The benchmark number is the time in miliseconds for some few selected OpenCL operations.0 -
WPNL wrote:
Gathered GPU benchmark data:
https://phodograf.com/capture-one-benchmarks/
Additions are welcome, thanks!
just posted to that 😊0 -
Chad Dahlquist wrote:
WPNL wrote:
Gathered GPU benchmark data:
https://phodograf.com/capture-one-benchmarks/
Additions are welcome, thanks!
just posted to that 😊
They're added 😊 Thanks for you submission!0 -
Christian Gruner
According to one of exporting benchmarks, that was made by gnwooding in this thread
with GTX1080Ti there is no difference in performance in single GPU mode compared to double GPU in SLI mode - about 55 seconds both. Why this happens? Is it because of incorrect testing or internal C1 limitations? There is also benchmark by garrison with GTX1070 and slower CPU, and it was also done in 55 seconds. That looks strange, because CPU+GPU performace is really slower than in previous bench, but results are the same. Can you explain this? It is really very interesting 😊
Can you also reveal the situation with CPU & GPU performance: can one of them bottleneck other one and, if so, how can be balanced configuration achieved.
Thank you!0 -
Those benchmarks are also depending on other parts in the build, such as RAM/HDD/SSD? 0 -
garrison wrote:
Christian Gruner
According to one of exporting benchmarks, that was made by gnwooding in this thread
with GTX1080Ti there is no difference in performance in single GPU mode compared to double GPU in SLI mode - about 55 seconds both. Why this happens? Is it because of incorrect testing or internal C1 limitations? There is also benchmark by garrison with GTX1070 and slower CPU, and it was also done in 55 seconds. That looks strange, because CPU+GPU performace is really slower than in previous bench, but results are the same. Can you explain this? It is really very interesting 😊
Can you also reveal the situation with CPU & GPU performance: can one of them bottleneck other one and, if so, how can be balanced configuration achieved.
Thank you!
Well, SLI doesn't magically give you more power. The reason for SLI/Crossfire is direct the video-feed to 1 port on 1 card, and thus have all power to that port.
CO instead simply distributes to all available cards for OpenCL computation, not for graphical output. For the same reason, there doesn't even need to be a monitor attached to a card for CO to use it.
A bottleneck situation between disk/CPU/GPU is quite complex. It also dependent upon the megapixel-count of the files being processed (given that the disk can read/write fast enough).0 -
Christian Gruner wrote:
garrison wrote:
Christian Gruner
According to one of exporting benchmarks, that was made by gnwooding in this thread
with GTX1080Ti there is no difference in performance in single GPU mode compared to double GPU in SLI mode - about 55 seconds both. Why this happens? Is it because of incorrect testing or internal C1 limitations? There is also benchmark by garrison with GTX1070 and slower CPU, and it was also done in 55 seconds. That looks strange, because CPU+GPU performace is really slower than in previous bench, but results are the same. Can you explain this? It is really very interesting 😊
Can you also reveal the situation with CPU & GPU performance: can one of them bottleneck other one and, if so, how can be balanced configuration achieved.
Thank you!
Well, SLI doesn't magically give you more power. The reason for SLI/Crossfire is direct the video-feed to 1 port on 1 card, and thus have all power to that port.
CO instead simply distributes to all available cards for OpenCL computation, not for graphical output. For the same reason, there doesn't even need to be a monitor attached to a card for CO to use it.
Did you read what he wrote? The question isn't in reference to SLI, it's in reference to two cards being installed vs. 1, and not seeing the 'linear' difference in performance.
The difference between the benchmark with one 1080Ti vs the benchmark with two 1080Ti's is minuscule -- why? Is the CPU bottlenecking the scenario? Does there need to be more cores, or more threads to supply the GPU's?
The benchmarks he provided also show CPU usage at 50%. No idea what he means by that -- but I'm assuming clock speed isn't a bottleneck, cores are.A bottleneck situation between disk/CPU/GPU is quite complex. It also dependent upon the megapixel-count of the files being processed (given that the disk can read/write fast enough).
We're pretty clever people, you can talk complex to us. Please be clear so we don't throw money at the wrong solutions.0 -
Just added my data to the score topic. The benchmark score is appalling on my Mac Mini 2012, but CO is still fairly usable (this is my daily computer on which I do all my photography work). Will serve as a reference point if I ever get around to adding an eGPU. 😊 0 -
photoGrant wrote:
Christian Gruner wrote:
garrison wrote:
Christian Gruner
According to one of exporting benchmarks, that was made by gnwooding in this thread
with GTX1080Ti there is no difference in performance in single GPU mode compared to double GPU in SLI mode - about 55 seconds both. Why this happens? Is it because of incorrect testing or internal C1 limitations? There is also benchmark by garrison with GTX1070 and slower CPU, and it was also done in 55 seconds. That looks strange, because CPU+GPU performace is really slower than in previous bench, but results are the same. Can you explain this? It is really very interesting 😊
Can you also reveal the situation with CPU & GPU performance: can one of them bottleneck other one and, if so, how can be balanced configuration achieved.
Thank you!
Well, SLI doesn't magically give you more power. The reason for SLI/Crossfire is direct the video-feed to 1 port on 1 card, and thus have all power to that port.
CO instead simply distributes to all available cards for OpenCL computation, not for graphical output. For the same reason, there doesn't even need to be a monitor attached to a card for CO to use it.
Did you read what he wrote? The question isn't in reference to SLI, it's in reference to two cards being installed vs. 1, and not seeing the 'linear' difference in performance.
The difference between the benchmark with one 1080Ti vs the benchmark with two 1080Ti's is minuscule -- why? Is the CPU bottlenecking the scenario? Does there need to be more cores, or more threads to supply the GPU's?
The benchmarks he provided also show CPU usage at 50%. No idea what he means by that -- but I'm assuming clock speed isn't a bottleneck, cores are.A bottleneck situation between disk/CPU/GPU is quite complex. It also dependent upon the megapixel-count of the files being processed (given that the disk can read/write fast enough).
We're pretty clever people, you can talk complex to us. Please be clear so we don't throw money at the wrong solutions.
Ah, yep, misinterpreted his table. However, I think he must have set up in a non-default way, as I cant reproduce the behavior inhouse. Here I still get 2 unique adapters shown in CO, with the same benchmark as with non-SLI setup.
Regarding the a new setup, the area is too complex for proper and complete write-up in a forum-thread.
In the specific case of the mega-pixel count, the issue is that decompressing a raw-file has some overhead, and thus the loading time is not always linear with the mega-pixels count. However, the raw processing time of the image pipeline is more or less linear with the mega-pixels count.
So, when you are processing a lot of small files, the CPU will have to load more files pr seconds compared to i.e.100 mp files, and thus it will be more loaded, to keep up with the GPU('s) doing the processing.
As an example, my 7900x with 10 cores and 3 fast AMD GPU's will keep the CPU almost at 100% load when processing 5d3 files (100 files in 27 seconds, 100% scale, tiff).0 -
n/a 0 -
Christian Gruner, first of all I would like to appreciate you for support despite it's a user-to-user forum! Your answers help us deeply understand C1 to find out which hardware is better suited for image processing workflow. That could really help us and other C1 users save some money. And there are several kind requests for you, that could help us to reveal some ambiguities. - Could you please make this benchmark and tell us results on your machine (would be great for JPEG and also TIFF format)?
- Can you approve that when C1 exports raws to jpeg format, the jpeg compression stage is a single CPU-core process that can bottleneck overall exporting process and prevent using all the hardware (CPU & GPU) power to all it's capabilities? This suggestion was made by StephanR in this thread
and confirmed by photoGrant and some exporting benchmarks with the same results, although very different CPUs & GPUs were used (for example, Core i7 3770 & GTX1070 - vs - Core i7 5820k & 1080Ti x2 in SLI).- If we decompose C1 raw exporting pipeline to theese main stages (hope I didn't miss smth):
- read raw file from disk to RAM
- decompress raw file to internal C1 image format
- apply image adjustments made by user
- compress image to JPEG format
- write file to disk
Which stages are GPU accelerated and which are multithread-optimized for multi-core CPU? Which are only one-threaded (and thus can be bottlenecking)?- Does the export process run sequentially for each file or run parallel for several files simultaneously? If it is made one by one, can it be parallel after optimizing algorithms by C1 development team?
- What GPU parameters we should pay attention first, choosing it for C1:
- CUDA cores count
- TMUs count
- ROPs count
- core clock speed
- memory bandwidth
- memory size
- ... maybe smth else?
For example, which is better: GTX1050 or GTX770 (they are almost the same price in my place)?
Thank you in advance!0 -
@Garrision
1: Done and posted.
2: I will have investigate that further to make sure I am correct.
3: The forum is not the right place for a detailed explanation, but with the exception of the above, then all steps are either multi-threaded or are run in parallel.
4: See above, we load the next raw file on CPU, while the current raw is being rendered by GPU.
5: Primarily the Cuda cores for Nvidia or the Stream processors for AMD. However, all the rest of the card's specs will also chime in on the final verdict. So the benchmark made by CO is indeed a better value for judging relative performance in CO.0 -
Christian Gruner wrote:
@Garrision
1: Done and posted.
2: I will have investigate that further to make sure I am correct.
3: The forum is not the right place for a detailed explanation, but with the exception of the above, then all steps are either multi-threaded or are run in parallel.
4: See above, we load the next raw file on CPU, while the current raw is being rendered by GPU.
5: Primarily the Cuda cores for Nvidia or the Stream processors for AMD. However, all the rest of the card's specs will also chime in on the final verdict. So the benchmark made by CO is indeed a better value for judging relative performance in CO.
Thank you, Christian!
We will be eagerly waiting for information from you about JPEG export.
And another question for you (hope I have not bothered you yet 😊 ).
Could you explain why CPU & GPU are not loaded to about 100% even during TIFF export? It can be seen on this graph (made on my PC with this export benchmark ):
https://s10.postimg.org/lz5yiaejt/forum4.png
It seems that SSD has performance reserve too, so it cannot be weak point of overall process.
If CO could export raws using, for example, 2 or more parallel pipelines instead of one, would that help to optimize the process and load the hardware to its full abilities? I guess than despite stages of single export pipeline are optimized, there are situations when the hardware is idle and waiting until other stage is done. That could explain why the hardware is not fully loaded (I may be wrong, of course).0
Please sign in to leave a comment.
Comments
35 comments