20210909, 20:28  #34 
Aug 2002
3^{2}·929 Posts 
We are not sure if this is interesting or not.
13_2_909m1  NearCunningham  SNFS(274) This is a big (33 bit?) job. The msieve.dat file, uncompressed and with duplicates and bad relations removed, is 49GB. Code:
$ ls lh total 105G rwrwr. 1 m m 36G Sep 8 20:33 13_2_909m1.dat.gz drwx. 2 m m 50 Aug 4 12:17 cub r. 1 m m 29K Aug 4 12:16 lanczos_kernel.ptx rx. 1 m m 3.4M Aug 4 12:16 msieve rwrwr. 1 m m 49G Sep 8 22:02 msieve.dat rwrwr. 1 m m 4.2G Sep 9 14:17 msieve.dat.bak.chk rwrwr. 1 m m 4.2G Sep 9 14:54 msieve.dat.chk rwrwr. 1 m m 969M Sep 9 12:11 msieve.dat.cyc rwrwr. 1 m m 12G Sep 9 12:11 msieve.dat.mat rwrwr. 1 m m 415 Sep 2 19:15 msieve.fb rwrwr. 1 m m 13K Sep 9 15:10 msieve.log r. 1 m m 108K Aug 4 12:16 stage1_core.ptx rwrwr. 1 m m 264 Sep 2 19:15 worktodo.ini Code:
commencing linear algebra using VBITS=256 skipping matrix build matrix starts at (0, 0) matrix is 27521024 x 27521194 (12901.7 MB) with weight 3687594306 (133.99/col) sparse part has weight 3106904079 (112.89/col) saving the first 240 matrix rows for later matrix includes 256 packed rows matrix is 27520784 x 27521194 (12034.4 MB) with weight 2848207923 (103.49/col) sparse part has weight 2714419599 (98.63/col) using GPU 0 (Quadro RTX 8000) selected card has CUDA arch 7.5 Nonzeros per block: 1000000000 converting matrix to CSR and copying it onto the GPU 1000000013 27520784 9680444 1000000057 27520784 11295968 714419529 27520784 6544782 1039631367 27521194 100000 917599197 27521194 3552480 757189035 27521194 23868304 commencing Lanczos iteration vector memory use: 5879.2 MB dense rows memory use: 839.9 MB sparse matrix memory use: 21339.3 MB memory use: 28058.3 MB Allocated 123.0 MB for SpMV library Allocated 127.8 MB for SpMV library linear algebra at 0.0%, ETA 49h57m7521194 dimensions (0.0%, ETA 49h57m) checkpointing every 570000 dimensions linear algebra completed 925789 of 27521194 dimensions (3.4%, ETA 45h13m) received signal 2; shutting down linear algebra completed 926044 of 27521194 dimensions (3.4%, ETA 45h12m) lanczos halted after 3628 iterations (dim = 926044) BLanczosTime: 5932 elapsed time 01:38:53 current factorization was interrupted We have the raw files saved if there are other configurations worth investigating. If so, just let us know! 
20210909, 21:00  #35 
"Curtis"
Feb 2005
Riverside, CA
11×461 Posts 
It's a 32/33 hybrid, with a healthy amount of oversieving (I wanted a matrix below 30M dimensions, success!).
I'm impressed that fits on your card, and 50hr is pretty amazing I just started the matrix a few hr ago on a 10core Ivy Bridge, ETA is 365 hr. If you have the free cycles to run it, please be my guest! That 20+ core weeks saved is enough to ECM the next candidate. 
20210913, 05:22  #36 
Jul 2003
So Cal
2^{4}·139 Posts 
I spent time with Nsight Compute looking at the SpMV kernel. As expected for SpMV it's memory bandwidth limited, so increasing occupancy to hide latency should help. I adjusted parameters to reduce both register and shared memory use, which increased the occupancy. This yielded a runtime improvement of only about 5% on the V100 but it may differ on other cards. I also increased the default block_nnz to 1750M to reduce global memory use a bit.

20210916, 06:00  #37 
Jul 2003
So Cal
8B0_{16} Posts 
Today I expanded the allowed values of VBITS to any of 64, 128, 192, 256, 320, 384, 448, or 512. This works on both CPUs and GPUs, but I don't expect much, if any, speedup on CPUs. As a GPU benchmark, I tested a 42.1M matrix on two NVLinkconnected V100's. Here are the results.
Code:
VBITS Time (hours) 64 109.5 128 63.75 192 50 256 40.25 320 40.25 384 37.75 448 40.25 512 37.25 
20210917, 11:38  #38 
Aug 2002
3^{2}·929 Posts 
Our system has a single GPU. When we are doing compute work on the GPU the display lags. We can think of two ways to fix this.
Some sort of niceness assignment to the compute process. Limiting the compute process to less than 100% of the GPU. Are either of these approaches possible? 
20210917, 11:41  #39 
Aug 2002
3^{2}×929 Posts 
Since GPU LA is so fast, should we rethink how many relations are generated by the sieving process?

20210917, 12:15  #40 
Jun 2003
2×23×113 Posts 

20210917, 14:56  #41  
Jul 2003
So Cal
2^{4}×139 Posts 
Quote:
Edit: Lowering VBITS will also reduce kernel runtimes, but don't go below 128. See the benchmark a few posts above. Also, you can't change VBITS in the middle of a run. You would need to start over from the beginning. You can change block_nnz during a restart. Last fiddled with by frmky on 20210917 at 15:40 

20210917, 15:37  #42 
Sep 2009
2×7×157 Posts 

20210917, 16:52  #43  
"Curtis"
Feb 2005
Riverside, CA
11×461 Posts 
Quote:
Another way to view this is to aim for the number of relations one would use if one were doing the entire job on one's own equipment, and then add just a bit to reduce the chance of needing to ask for more Q from admin (like round Q up to the nearest 5M or 10M increment). 

20210917, 18:56  #44 
Aug 2002
3^{2}·929 Posts 
What is the difference in relations needed between TD=120 and TD=100? (Do we have this data?)
We think a GPU could do a TD=100 job faster than a CPU could do a TD=120 job. Personally, we don't mind having to rerun matrix building if there aren't enough relations. We don't know if it is a drag for the admins to add additional relations, but if it isn't a big deal the project could probably run more efficiently. There doesn't seem to be a shortage of LA power so maybe the project could skew a bit in favor of more jobs overall with less relations per job? Is the bottleneck server storage space? What percentage in CPUhours is the sieving versus the postprocessing work? Does one additional hour of postprocessing "save" 1000 hours of sieving? More? Less? (We lack the technical knowledge and vocabulary to express what we are thinking. Hopefully what we wrote makes a little sense.) 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Resume linear algebra  Timic  Msieve  35  20201005 23:08 
use msieve linear algebra after CADONFS filtering  aein  Msieve  2  20171005 01:52 
Has anyone tried linear algebra on a Threadripper yet?  fivemack  Hardware  3  20171003 03:11 
Linear algebra at 600%  CRGreathouse  Msieve  8  20090805 07:25 
Linear algebra proof  Damian  Math  8  20070212 22:25 