100 views

Skip to first unread message

May 8, 2021, 11:28:29 AMMay 8

to

Hi,

I have a certain interest in a mathematical puzzle that I have not

been able to solve using a normal CPU, and I thought that using

an FPGA could work.

For this, I would like to assign some work packages to search

for certain numbers to the FPGA, which then processes them and

returns the data, plus an indication that it has finished with

that particular package.

The task at hand is extremely parallel, so FPGAs should be a

good match. However, I have zero actual experience with FPGAs,

and I have no idea how to go about assigning the work packages

and getting back the results.

Any pointers? What sort of board should I look for, and how

should I handle the communication?

(For those who are interested: I want to find numbers other than

zero and one for which the sum of digits in all prime bases up

to 17 is the same, the successor to https://oeis.org/A335839 ,

so to speak).

I have a certain interest in a mathematical puzzle that I have not

been able to solve using a normal CPU, and I thought that using

an FPGA could work.

For this, I would like to assign some work packages to search

for certain numbers to the FPGA, which then processes them and

returns the data, plus an indication that it has finished with

that particular package.

The task at hand is extremely parallel, so FPGAs should be a

good match. However, I have zero actual experience with FPGAs,

and I have no idea how to go about assigning the work packages

and getting back the results.

Any pointers? What sort of board should I look for, and how

should I handle the communication?

(For those who are interested: I want to find numbers other than

zero and one for which the sum of digits in all prime bases up

to 17 is the same, the successor to https://oeis.org/A335839 ,

so to speak).

May 8, 2021, 1:02:19 PMMay 8

to

On 08/05/2021 16:28, Thomas Koenig wrote:

> Hi,

>

> I have a certain interest in a mathematical puzzle that I have not

> been able to solve using a normal CPU, and I thought that using

> an FPGA could work.

>

...
> Hi,

>

> I have a certain interest in a mathematical puzzle that I have not

> been able to solve using a normal CPU, and I thought that using

> an FPGA could work.

>

I think you are moving in the wrong direction, if you can't solve it

with some numerical package like numpy/linpack then it is highly

unlikely you will succeed with an FPGA based solution.

What you probably want is a fast graphics card + CUDA/OpenCL which will

most likely outperform your FPGA based design.

Still it will be an interesting learning exercise ;-)

Hans

www.ht-lab.com

May 9, 2021, 4:05:27 AMMay 9

to

On 5/8/21 5:28 PM, Thomas Koenig wrote:

> I have a certain interest in a mathematical puzzle that I have not

> been able to solve using a normal CPU, and I thought that using

> an FPGA could work.

Out of curiosity, what is the specific issue you encounter using a
> I have a certain interest in a mathematical puzzle that I have not

> been able to solve using a normal CPU, and I thought that using

> an FPGA could work.

'normal' CPU ?

As you say:

> The task at hand is extremely parallel,

atoms to process over p cpu, the cpu would take a time

t_cpu = ta_cpu * round-up(n/p_cpu)

The only way a FPGA can beat that is if it:

a) has a ta_fpga <<< ta_cpu while retaining p_fpga ~= p_cpu

b) has a p_fpga >>> p_cpu while retaining ta_fpga ~= ta_cpu

c) has a ta_fpga <<< ta_cpu and a p_fpga >>> p_cpu (ideal case)

Depending on how much you're willing to spend (big FPGAs aren't cheap),

the first question would be, how big can you get 'p_cpu' ? Using MPI to

distribute the atoms of work over a lot of cores should not be very

difficult, and a 'lot of cores' can be obtained easily from cloud

providers nowadays.

FPGAs are not as easy to tryout, today I think it's pretty much Amazon

F1 in the cloud - or buying.

That being said, FPGA vendors promote a lot of solutions for this

particular problem, from low-level solutions (e.g. a PCIe core and a lot

of hand-written Verilog/VHDL/...) to high-level solutions (e.g.

<https://www.intel.com/content/www/us/en/programmable/documentation/div1537518568620.html>,

<https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html>,

etc.). Those solutions can be with stand-alone FPGAs or with the FPGA

integrated in a SoC with normal cores (e.g. Xilinx Zynq families, among

others).

There's also non-vendor solutions, mostly accelerated SoC such as

<https://www.esp.cs.columbia.edu/> or

<https://github.com/google/CFU-Playground> (extension to

<https://github.com/enjoy-digital/litex>) that can help get started.

Cordially,

--

Romain

May 9, 2021, 4:22:56 AMMay 9

to

HT-Lab <han...@htminuslab.com> schrieb:

> then it is highly

> unlikely you will succeed with an FPGA based solution.

An FPGA would be quite good, IMHO.

What I would need are things like

- an efficient (base 2) popcount operation

- counters in base 3, 5, 7,11, 13 and 17

- adders for all of the bases above

- efficient popcount operations for all of the bases

above

plus handling of numbers in the region of 72 bits.

> What you probably want is a fast graphics card + CUDA/OpenCL which will

> most likely outperform your FPGA based design.

That is an alternative. I am also looking at that, but

FPGAs seem to be more interesting, at the moment.

> Still it will be an interesting learning exercise ;-)

Certainly.

Therefore, what sort of system should I be looking for? I don't

want to spend my whole time writing Linux kernel drivers or

Bluetooth communication drivers for the FPGA :-)

So, something that can be interfaced easily with a computer

(either on board or with a host computer running Linux) would

be great.

> On 08/05/2021 16:28, Thomas Koenig wrote:

>> Hi,

>>

>> I have a certain interest in a mathematical puzzle that I have not

>> been able to solve using a normal CPU, and I thought that using

>> an FPGA could work.

>>

> ...

>

> I think you are moving in the wrong direction, if you can't solve it

> with some numerical package like numpy/linpack

Definitely not the right kind of problem.
>> Hi,

>>

>> I have a certain interest in a mathematical puzzle that I have not

>> been able to solve using a normal CPU, and I thought that using

>> an FPGA could work.

>>

> ...

>

> I think you are moving in the wrong direction, if you can't solve it

> with some numerical package like numpy/linpack

> then it is highly

> unlikely you will succeed with an FPGA based solution.

What I would need are things like

- an efficient (base 2) popcount operation

- counters in base 3, 5, 7,11, 13 and 17

- adders for all of the bases above

- efficient popcount operations for all of the bases

above

plus handling of numbers in the region of 72 bits.

> What you probably want is a fast graphics card + CUDA/OpenCL which will

> most likely outperform your FPGA based design.

FPGAs seem to be more interesting, at the moment.

> Still it will be an interesting learning exercise ;-)

Therefore, what sort of system should I be looking for? I don't

want to spend my whole time writing Linux kernel drivers or

Bluetooth communication drivers for the FPGA :-)

So, something that can be interfaced easily with a computer

(either on board or with a host computer running Linux) would

be great.

May 9, 2021, 7:50:48 AMMay 9

to

Romain Dolbeau <rom...@dolbeau.org> schrieb:

I managed to search the number space up to around 2^64 in around

half a CPU year (from which you can tell that one key is to

reduce the search space).

>> The task at hand is extremely parallel,

>

> Typically, assuming a constant-time (of duration ta) atom of work and n

> atoms to process over p cpu, the cpu would take a time

> t_cpu = ta_cpu * round-up(n/p_cpu)

>

> The only way a FPGA can beat that is if it:

> a) has a ta_fpga <<< ta_cpu while retaining p_fpga ~= p_cpu

> b) has a p_fpga >>> p_cpu while retaining ta_fpga ~= ta_cpu

> c) has a ta_fpga <<< ta_cpu and a p_fpga >>> p_cpu (ideal case)

There are things that an FPGA should be able to do better than

a CPU. One example is implementing a base-n counter, which is a

serial operation on a CPU and can easily be done in parallel on

an FPGA.

> Depending on how much you're willing to spend (big FPGAs aren't cheap),

> the first question would be, how big can you get 'p_cpu' ? Using MPI to

> distribute the atoms of work over a lot of cores should not be very

> difficult, and a 'lot of cores' can be obtained easily from cloud

> providers nowadays.

That is of course a possibility. In the CPU-based approach I

simply used OpenMP with schedule(dynamic). However, for this

kind of hobbyist thing, I'd rather learn something interesting

than throw money at a cloud provider :-)

> That being said, FPGA vendors promote a lot of solutions for this

> particular problem, from low-level solutions (e.g. a PCIe core and a lot

> of hand-written Verilog/VHDL/...) to high-level solutions (e.g.

><https://www.intel.com/content/www/us/en/programmable/documentation/div1537518568620.html>,

><https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html>,

> etc.). Those solutions can be with stand-alone FPGAs or with the FPGA

> integrated in a SoC with normal cores (e.g. Xilinx Zynq families, among

> others).

> There's also non-vendor solutions, mostly accelerated SoC such as

><https://www.esp.cs.columbia.edu/> or

><https://github.com/google/CFU-Playground> (extension to

><https://github.com/enjoy-digital/litex>) that can help get started.

Thanks for the pointers.

Seems to be rather high-level, and also rather abstract (ok, so these

systems are usually aimed at professionals, not at hobbyists).

I'll look around a bit and see if I can find anything that helps me,

but at the moment, I have to say it all looks rather daunting :-)

> On 5/8/21 5:28 PM, Thomas Koenig wrote:

>> I have a certain interest in a mathematical puzzle that I have not

>> been able to solve using a normal CPU, and I thought that using

>> an FPGA could work.

>

> Out of curiosity, what is the specific issue you encounter using a

> 'normal' CPU ?

It's too slow.
>> I have a certain interest in a mathematical puzzle that I have not

>> been able to solve using a normal CPU, and I thought that using

>> an FPGA could work.

>

> Out of curiosity, what is the specific issue you encounter using a

> 'normal' CPU ?

I managed to search the number space up to around 2^64 in around

half a CPU year (from which you can tell that one key is to

reduce the search space).

>> The task at hand is extremely parallel,

>

> Typically, assuming a constant-time (of duration ta) atom of work and n

> atoms to process over p cpu, the cpu would take a time

> t_cpu = ta_cpu * round-up(n/p_cpu)

>

> The only way a FPGA can beat that is if it:

> a) has a ta_fpga <<< ta_cpu while retaining p_fpga ~= p_cpu

> b) has a p_fpga >>> p_cpu while retaining ta_fpga ~= ta_cpu

> c) has a ta_fpga <<< ta_cpu and a p_fpga >>> p_cpu (ideal case)

a CPU. One example is implementing a base-n counter, which is a

serial operation on a CPU and can easily be done in parallel on

an FPGA.

> Depending on how much you're willing to spend (big FPGAs aren't cheap),

> the first question would be, how big can you get 'p_cpu' ? Using MPI to

> distribute the atoms of work over a lot of cores should not be very

> difficult, and a 'lot of cores' can be obtained easily from cloud

> providers nowadays.

simply used OpenMP with schedule(dynamic). However, for this

kind of hobbyist thing, I'd rather learn something interesting

than throw money at a cloud provider :-)

> That being said, FPGA vendors promote a lot of solutions for this

> particular problem, from low-level solutions (e.g. a PCIe core and a lot

> of hand-written Verilog/VHDL/...) to high-level solutions (e.g.

><https://www.intel.com/content/www/us/en/programmable/documentation/div1537518568620.html>,

><https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html>,

> etc.). Those solutions can be with stand-alone FPGAs or with the FPGA

> integrated in a SoC with normal cores (e.g. Xilinx Zynq families, among

> others).

> There's also non-vendor solutions, mostly accelerated SoC such as

><https://www.esp.cs.columbia.edu/> or

><https://github.com/google/CFU-Playground> (extension to

><https://github.com/enjoy-digital/litex>) that can help get started.

Seems to be rather high-level, and also rather abstract (ok, so these

systems are usually aimed at professionals, not at hobbyists).

I'll look around a bit and see if I can find anything that helps me,

but at the moment, I have to say it all looks rather daunting :-)

May 9, 2021, 8:31:27 AMMay 9

to

On 5/9/21 1:50 PM, Thomas Koenig wrote:

> That is of course a possibility. In the CPU-based approach I

> simply used OpenMP with schedule(dynamic). However, for this

> kind of hobbyist thing, I'd rather learn something interesting

> than throw money at a cloud provider :-)

Agreed. But distributed computing can do wonder for many brute-force
> That is of course a possibility. In the CPU-based approach I

> simply used OpenMP with schedule(dynamic). However, for this

> kind of hobbyist thing, I'd rather learn something interesting

> than throw money at a cloud provider :-)

mathematical problems (e.g.

<https://en.wikipedia.org/wiki/Lychrel_number#196_palindrome_quest> ;-) ).

> Seems to be rather high-level, and also rather abstract (ok, so these

> systems are usually aimed at professionals, not at hobbyists).

> I'll look around a bit and see if I can find anything that helps me,

> but at the moment, I have to say it all looks rather daunting :-)

(a) implementing the hardware operator(s) in a fast/efficient way;

(b) integrating said operator(s) in an acceleration framework.

Tackling both at once can be daunting. However, tackling only the first

(which is likely the most interesting one and sort-of-research) should

be quite achievable. Once that works, you can figure out how to put them

in 'production' by having the right set up (number of operators, speed,

memory requirements, etc.).

E.g., if you can express the problem as a (set of) 32x32 -> 32 operators

(with some ad-hoc encoding, presumably), you can easily add dedicated

instructions in a 32-bits softcore to evaluate the performance benefit.

Then you can figure out a way to high-performance. Some softcore might

let you do wider operands/results - 64-bits should not be much of a

problem with 64-bit softcores.

Blowing my own horn here (sorry), but for instance you can have a look

at: <https://github.com/rdolbeau/VexRiscvBPluginGenerator/> which is

designed to easily add integer-pipeline instructions to a VexRiscv

(RV32) core in a Linux-capable Litex SoC. I run a 100 MHz quad-cores on

a ~$90 board; it won't be faster than a beefy CPU, but it's a cheap way

to start evaluating implementations. There's also support for 32x32->64

instructions (from the draft P 'packed simd' extension to RISC-V [2]),

32x32x32->32 instructions (from the draft Zbt 'bitmanip ternary'

extension [1]) and you can even do 32x32x32->64 if you really want (by

abusing both systems, for instance to implement a faster Chacha [3], see

e.g.

<https://github.com/rdolbeau/VexRiscvBPluginGenerator/blob/master/data_Chacha64.txt>).

That would the easiest way to prototype some operators, I believe.

Alternatively, VexRiscv has a FPU 'coprocessor' that can be an

inspiration to implement a dedicated unit with data width up to 64 bits

(but it will be more complicated).

Finally you can go for the full-custom acceleration peripheral; the

CFU-playgrounds is one way, or you can look at the ESP project as they

are more focused on acceleration. But that's basically solving (b) along

with (a).

Cordially,

Romain

[1] <https://github.com/riscv/riscv-bitmanip>

[2] <https://github.com/riscv/riscv-p-spec>

[3] <https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant>

--

Romain

May 9, 2021, 11:30:41 AMMay 9

to

On 09/05/2021 09:22, Thomas Koenig wrote:

> HT-Lab <han...@htminuslab.com> schrieb:

>> On 08/05/2021 16:28, Thomas Koenig wrote:

>>> Hi,

>>>

>>> I have a certain interest in a mathematical puzzle that I have not

>>> been able to solve using a normal CPU, and I thought that using

>>> an FPGA could work.

>>>

>> ...

>>

>> I think you are moving in the wrong direction, if you can't solve it

>> with some numerical package like numpy/linpack

>

> Definitely not the right kind of problem.

OK, I must admit I didn't really look closely at the page you gave but I
> HT-Lab <han...@htminuslab.com> schrieb:

>> On 08/05/2021 16:28, Thomas Koenig wrote:

>>> Hi,

>>>

>>> I have a certain interest in a mathematical puzzle that I have not

>>> been able to solve using a normal CPU, and I thought that using

>>> an FPGA could work.

>>>

>> ...

>>

>> I think you are moving in the wrong direction, if you can't solve it

>> with some numerical package like numpy/linpack

>

> Definitely not the right kind of problem.

do know for a lot of numerical intensive calculations a modern PC + Cuda

is not easily beaten by an FPGA especially in terms of cost and

development time.

>

>> then it is highly

>> unlikely you will succeed with an FPGA based solution.

>

> An FPGA would be quite good, IMHO.

>

> What I would need are things like

>

> - an efficient (base 2) popcount operation

This is easy as most processors have a POPCNT instruction so you should
>> then it is highly

>> unlikely you will succeed with an FPGA based solution.

>

> An FPGA would be quite good, IMHO.

>

> What I would need are things like

>

> - an efficient (base 2) popcount operation

be able to find some efficient RTL code on the web. In most cases it is

just a bunch of counters/adders.

>

> - counters in base 3, 5, 7,11, 13 and 17

with large word length, if not then a LUTs+adders could provide a fast

solution.

>

> - adders for all of the bases above

optimised vendors cores), do all your operations and move back to base

3..17?

>

> - efficient popcount operations for all of the bases

> above

>

> plus handling of numbers in the region of 72 bits.

That could be a problem as 72bits adders/popcnt will not be fast, you
> - efficient popcount operations for all of the bases

> above

>

> plus handling of numbers in the region of 72 bits.

will need to heavily pipeline and optimise your design which adds

another level of complexity.

>

>> What you probably want is a fast graphics card + CUDA/OpenCL which will

>> most likely outperform your FPGA based design.

>

> That is an alternative. I am also looking at that, but

> FPGAs seem to be more interesting, at the moment.

>

>> Still it will be an interesting learning exercise ;-)

>

> Certainly.

>

> Therefore, what sort of system should I be looking for? I don't

> want to spend my whole time writing Linux kernel drivers or

> Bluetooth communication drivers for the FPGA :-)

high. In this case I would go for a simple UART, you can easily get

1Mbits without much effort. No special drivers are required.

If you need more bandwidth then have a look at the many Future

Technology USB devices like the F232H which are easy to interface and

could give you up to 40MByte/sec transfer speeds. The drivers are freely

available for Windows and Linux. I have used them on a previous project

and they worked without any issue.

For anything higher get a PCIe FPGA development board which normally

come with drivers to fast DMA a block of data to and from the FPGA.

Good luck,

Hans

www.ht-lab.com

May 9, 2021, 11:46:26 AMMay 9

to

On 09/05/2021 12:50, Thomas Koenig wrote:

..snip

FPGA's are the best solution :-)

> I'd rather learn something interesting

> than throw money at a cloud provider :-)

>

>> That being said, FPGA vendors promote a lot of solutions for this

>> particular problem, from low-level solutions (e.g. a PCIe core and a lot

>> of hand-written Verilog/VHDL/...) to high-level solutions (e.g.

>> <https://www.intel.com/content/www/us/en/programmable/documentation/div1537518568620.html>,

>> <https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html>,

>> etc.). Those solutions can be with stand-alone FPGAs or with the FPGA

>> integrated in a SoC with normal cores (e.g. Xilinx Zynq families, among

>> others).

>

>> There's also non-vendor solutions, mostly accelerated SoC such as

>> <https://www.esp.cs.columbia.edu/> or

>> <https://github.com/google/CFU-Playground> (extension to

>> <https://github.com/enjoy-digital/litex>) that can help get started.

>

> Thanks for the pointers.

>

> Seems to be rather high-level, and also rather abstract (ok, so these

> systems are usually aimed at professionals, not at hobbyists).

>

> I'll look around a bit and see if I can find anything that helps me,

> but at the moment, I have to say it all looks rather daunting :-)

Just start small, take one of your required operators, say popcnt,

implement it in VHDL/(S)Verilog (or chisel/Python/C/etc) and simulate

it. Next get a low cost board from eBay, download the free vendor tools

and try to implement it. Depending on the prototype board you can

probably use some switches and 7-segment display for I/O.

Good luck,

Regards,

Hans.

www.ht-lab.com

..snip

>

> That is of course a possibility. In the CPU-based approach I

> simply used OpenMP with schedule(dynamic). However, for this

> kind of hobbyist thing,

Ah, I assumed this was some commercial project, in that case go for it,
> That is of course a possibility. In the CPU-based approach I

> simply used OpenMP with schedule(dynamic). However, for this

> kind of hobbyist thing,

FPGA's are the best solution :-)

> I'd rather learn something interesting

> than throw money at a cloud provider :-)

>

>> That being said, FPGA vendors promote a lot of solutions for this

>> particular problem, from low-level solutions (e.g. a PCIe core and a lot

>> of hand-written Verilog/VHDL/...) to high-level solutions (e.g.

>> <https://www.intel.com/content/www/us/en/programmable/documentation/div1537518568620.html>,

>> <https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html>,

>> etc.). Those solutions can be with stand-alone FPGAs or with the FPGA

>> integrated in a SoC with normal cores (e.g. Xilinx Zynq families, among

>> others).

>

>> There's also non-vendor solutions, mostly accelerated SoC such as

>> <https://www.esp.cs.columbia.edu/> or

>> <https://github.com/google/CFU-Playground> (extension to

>> <https://github.com/enjoy-digital/litex>) that can help get started.

>

> Thanks for the pointers.

>

> Seems to be rather high-level, and also rather abstract (ok, so these

> systems are usually aimed at professionals, not at hobbyists).

>

> I'll look around a bit and see if I can find anything that helps me,

> but at the moment, I have to say it all looks rather daunting :-)

implement it in VHDL/(S)Verilog (or chisel/Python/C/etc) and simulate

it. Next get a low cost board from eBay, download the free vendor tools

and try to implement it. Depending on the prototype board you can

probably use some switches and 7-segment display for I/O.

Good luck,

Regards,

Hans.

www.ht-lab.com

May 9, 2021, 1:53:54 PMMay 9

to

levels of processing. It might make sense to look for one with a PCIe

connector that can be just connected to a PC to be a bit easier to

interface, but even a stand alone board, maybe with small embedded

processor that just sends answers out the serial port may be simpler.

For ideas of how to build the computation. Thinking a bit, the idea that

module-N counters are fairly simple it a good starting point. You

actually don't want a 'simple' counter as that says you can't get the

parrallism, But building N count by N counters sets (of 7 base-x

counters, 2, 3, 5, 7, 11, 13, 17).

Such a counter probably costs 2-3 Luts per bit per base, At your

approximately 72 bits numbers, we are talking about 2k luts per

computation unit.

For the biggest devices, we could maybe get 1000 of these into a very

largest FPGA, and likely could be processing at a few 100 MHz clock

rate, so you will be works at a total processing rate in the 100s of

Billion tests per second, which should allow you to make a rough

estimate of the speed it will process. You may not want to plan on the

very largest of FPGAs, as those ARE pricey (the board for the one I

looked up for this size was about $16,000).

May 9, 2021, 2:21:30 PMMay 9

to

Hi Thomas.

If I understood you correctly, what you want would be an FPGA engine/coprocessor that you make the equivalent calculations of the naïve C code that I have below. That is a pretty neat mathematical problem!

I hope that you know a more efficient way of converting any number to a sequence of digits of a given base than the one I have written.

The convertBase() algorithm that I wrote is not exactly FPGA friendly, but it can be managed to put in a FPGA in a efficient way with Dividers and Multipliers in pipeline maybe...

My advice is to find some metrics that you want for a first smallish FPGA engine/coprocessor (like process 10M numbers per second, using up 2000LUTS, 500FFs, 2BRAMs, 4 mults 18x18). Any FPGA board should be good to start this project, but for a beginner it is better to use some streamline board. Then, it is a matter of replicating that FPGA engine/coprocessor and how much money you can afford in buying a board with big FPGA device or some cloud time in some FPGA cloud server. And it is possible that something that you can put to work at 100MHz in a cheap FPGA board, may run at 400MHz in a very expensive one...

For "convertBase(m, 2); sum1 = SumArray();" you can use a pipelined 'popcount' arquitecture, for the other cases you may use pipelined tree adders (with a small numbers of bits this will be really fast). With pipeline, you can execute the section "SumArray();" as if it was being execute it in just one clock cycle at 100MHz or 200MHz or even more!

The not so FPGA friendly part is really the "convertBase()" algorithm. That loop with a division (and a multiplication) is a bit troublesome... I hope you know better algorithm to perform this part. I can think in ways of using pipelined dividers... but most likely it is not the most efficient way...

Regards,

Nelson

#include <stdio.h>

#include <stdint.h>

// Definition of Constants

#define C_VALUELIMIT_INIT 2000000000

#define C_VALUELIMIT_FINIT 2010000000

#define C_BASECONVEND 0xFF

#define C_DIGITMAXSIZE 256

// Definition of Global Variables

uint8_t conv[C_DIGITMAXSIZE];

// Definition of Functions

void convertBase(uint64_t n, uint8_t k) {

uint64_t l, j;

int i = 0;

if (n == 0) conv[i++] = '0';

while (n > 0) {

l = n / k;

j = n - k * l;

conv[i] = (uint32_t) j;

n = l;

i++;

}

conv[i] = C_BASECONVEND;

}

uint32_t SumArray() {

uint32_t sum = 0; int i = 0;

while (conv[i] != C_BASECONVEND) i++; i--;

for (; i >= 0; i--) sum += conv[i];

return sum;

}

int main() {

uint32_t sum1, sum2;

uint64_t m;

for (m = C_VALUELIMIT_INIT; m < (uint64_t) C_VALUELIMIT_FINIT; m++) {

convertBase(m, 2); sum1 = SumArray();

convertBase(m, 3); sum2 = SumArray(); if (sum1 != sum2) continue;

convertBase(m, 5); sum2 = SumArray(); if (sum1 != sum2) continue;

convertBase(m, 7); sum2 = SumArray(); if (sum1 != sum2) continue;

convertBase(m, 11); sum2 = SumArray(); if (sum1 != sum2) continue;

convertBase(m, 13); sum2 = SumArray(); if (sum1 != sum2) continue;

printf("Sequence number found %lld\n", m);

}

return 0;

}

If I understood you correctly, what you want would be an FPGA engine/coprocessor that you make the equivalent calculations of the naïve C code that I have below. That is a pretty neat mathematical problem!

I hope that you know a more efficient way of converting any number to a sequence of digits of a given base than the one I have written.

The convertBase() algorithm that I wrote is not exactly FPGA friendly, but it can be managed to put in a FPGA in a efficient way with Dividers and Multipliers in pipeline maybe...

My advice is to find some metrics that you want for a first smallish FPGA engine/coprocessor (like process 10M numbers per second, using up 2000LUTS, 500FFs, 2BRAMs, 4 mults 18x18). Any FPGA board should be good to start this project, but for a beginner it is better to use some streamline board. Then, it is a matter of replicating that FPGA engine/coprocessor and how much money you can afford in buying a board with big FPGA device or some cloud time in some FPGA cloud server. And it is possible that something that you can put to work at 100MHz in a cheap FPGA board, may run at 400MHz in a very expensive one...

For "convertBase(m, 2); sum1 = SumArray();" you can use a pipelined 'popcount' arquitecture, for the other cases you may use pipelined tree adders (with a small numbers of bits this will be really fast). With pipeline, you can execute the section "SumArray();" as if it was being execute it in just one clock cycle at 100MHz or 200MHz or even more!

The not so FPGA friendly part is really the "convertBase()" algorithm. That loop with a division (and a multiplication) is a bit troublesome... I hope you know better algorithm to perform this part. I can think in ways of using pipelined dividers... but most likely it is not the most efficient way...

Regards,

Nelson

#include <stdio.h>

#include <stdint.h>

// Definition of Constants

#define C_VALUELIMIT_INIT 2000000000

#define C_VALUELIMIT_FINIT 2010000000

#define C_BASECONVEND 0xFF

#define C_DIGITMAXSIZE 256

// Definition of Global Variables

uint8_t conv[C_DIGITMAXSIZE];

// Definition of Functions

void convertBase(uint64_t n, uint8_t k) {

uint64_t l, j;

int i = 0;

if (n == 0) conv[i++] = '0';

while (n > 0) {

l = n / k;

j = n - k * l;

conv[i] = (uint32_t) j;

n = l;

i++;

}

conv[i] = C_BASECONVEND;

}

uint32_t SumArray() {

uint32_t sum = 0; int i = 0;

while (conv[i] != C_BASECONVEND) i++; i--;

for (; i >= 0; i--) sum += conv[i];

return sum;

}

int main() {

uint32_t sum1, sum2;

uint64_t m;

for (m = C_VALUELIMIT_INIT; m < (uint64_t) C_VALUELIMIT_FINIT; m++) {

convertBase(m, 2); sum1 = SumArray();

convertBase(m, 3); sum2 = SumArray(); if (sum1 != sum2) continue;

convertBase(m, 5); sum2 = SumArray(); if (sum1 != sum2) continue;

convertBase(m, 7); sum2 = SumArray(); if (sum1 != sum2) continue;

convertBase(m, 11); sum2 = SumArray(); if (sum1 != sum2) continue;

convertBase(m, 13); sum2 = SumArray(); if (sum1 != sum2) continue;

printf("Sequence number found %lld\n", m);

}

return 0;

}

May 9, 2021, 2:59:51 PMMay 9

to

On 5/9/21 2:21 PM, Nelson Ribeiro wrote:

> Hi Thomas.

>

> If I understood you correctly, what you want would be an FPGA engine/coprocessor that you make the equivalent calculations of the naïve C code that I have below. That is a pretty neat mathematical problem!

> I hope that you know a more efficient way of converting any number to a sequence of digits of a given base than the one I have written.

> The convertBase() algorithm that I wrote is not exactly FPGA friendly, but it can be managed to put in a FPGA in a efficient way with Dividers and Multipliers in pipeline maybe...

>

> My advice is to find some metrics that you want for a first smallish FPGA engine/coprocessor (like process 10M numbers per second, using up 2000LUTS, 500FFs, 2BRAMs, 4 mults 18x18). Any FPGA board should be good to start this project, but for a beginner it is better to use some streamline board. Then, it is a matter of replicating that FPGA engine/coprocessor and how much money you can afford in buying a board with big FPGA device or some cloud time in some FPGA cloud server. And it is possible that something that you can put to work at 100MHz in a cheap FPGA board, may run at 400MHz in a very expensive one...

>

> For "convertBase(m, 2); sum1 = SumArray();" you can use a pipelined 'popcount' arquitecture, for the other cases you may use pipelined tree adders (with a small numbers of bits this will be really fast). With pipeline, you can execute the section "SumArray();" as if it was being execute it in just one clock cycle at 100MHz or 200MHz or even more!

>

> The not so FPGA friendly part is really the "convertBase()" algorithm. That loop with a division (and a multiplication) is a bit troublesome... I hope you know better algorithm to perform this part. I can think in ways of using pipelined dividers... but most likely it is not the most efficient way...

>

> Regards,

> Nelson

>

>

I would NOT do a convertBase() type archtecture for the FPGA. It is just
> Hi Thomas.

>

> If I understood you correctly, what you want would be an FPGA engine/coprocessor that you make the equivalent calculations of the naïve C code that I have below. That is a pretty neat mathematical problem!

> I hope that you know a more efficient way of converting any number to a sequence of digits of a given base than the one I have written.

> The convertBase() algorithm that I wrote is not exactly FPGA friendly, but it can be managed to put in a FPGA in a efficient way with Dividers and Multipliers in pipeline maybe...

>

> My advice is to find some metrics that you want for a first smallish FPGA engine/coprocessor (like process 10M numbers per second, using up 2000LUTS, 500FFs, 2BRAMs, 4 mults 18x18). Any FPGA board should be good to start this project, but for a beginner it is better to use some streamline board. Then, it is a matter of replicating that FPGA engine/coprocessor and how much money you can afford in buying a board with big FPGA device or some cloud time in some FPGA cloud server. And it is possible that something that you can put to work at 100MHz in a cheap FPGA board, may run at 400MHz in a very expensive one...

>

> For "convertBase(m, 2); sum1 = SumArray();" you can use a pipelined 'popcount' arquitecture, for the other cases you may use pipelined tree adders (with a small numbers of bits this will be really fast). With pipeline, you can execute the section "SumArray();" as if it was being execute it in just one clock cycle at 100MHz or 200MHz or even more!

>

> The not so FPGA friendly part is really the "convertBase()" algorithm. That loop with a division (and a multiplication) is a bit troublesome... I hope you know better algorithm to perform this part. I can think in ways of using pipelined dividers... but most likely it is not the most efficient way...

>

> Regards,

> Nelson

>

>

too unfriendly.

My thought was to build a series of 'Base-X' counters/accumulators, in

the bases, 2, 3, 5, 7, 11, 13, 17. This is a fairly simple operation,

especially since the increment value will be a constant equal to the

number copies of the system. Start with them all at the same value (like

0) and just increment them by the same value expressed in their base.

This becomes an easy one cycle to update system.

May 9, 2021, 4:21:04 PMMay 9

to

Nelson Ribeiro <ngrr.r...@gmail.com> schrieb:

> If I understood you correctly, what you want would be an FPGA

> engine/coprocessor that you make the equivalent calculations

> of the naïve C code that I have below. That is a pretty neat

> mathematical problem!

Yep, it's neat. What I did worked for all primes up to 13,

but 17 is just too far off (so far).

> I hope that you know a more efficient way of converting any number

> to a sequence of digits of a given base than the one I have written.

In the immortal words of Henry S. Warren of "Hacker's Delight"

fame: "On many computers, division is very time consuming and is

to be avoided when possible."

He also gives a neat bag of tricks of calculating the division

remainder of many odd constants, by selectively summing their

digits. This works for numbers n where

2 ^ m = 1 (mod n)

so you can break your number into chunks of m bits, add them

together and still have the same remainder.

Once you have calculated the remainder by repeated addition of

these chunks to a size you can manage, you can then divide by

multiplying with the modular inverse of its number.

This will give you a single digit of your base n number, to

be repeated until the number has been converted to base n.

For base three, 4 = 3+1, so any grouping of bits with

an even number works.

I understand most FPGAs have six-bit lookup tables these days.

For calculating the remainder base three, that is actually

pretty handy - use 12*2 LUTs to reduce the bits from 72

to 24 in one go. Repeat, and you are left with 8 bits,

which is definitely managable.

Of course, then comes the 72*72 bit multiplication, which is

probably going to take some time...

Base 11 and 13 are less friendly, they would need 10 respectively

12 bit lookup tables.

> For "convertBase(m, 2); sum1 = SumArray();" you can use a

> pipelined 'popcount' arquitecture,

That is one thing I already looked at. There is a rather

elegant popcnt implementation using a 6-bit counter.

> If I understood you correctly, what you want would be an FPGA

> engine/coprocessor that you make the equivalent calculations

> of the naïve C code that I have below. That is a pretty neat

> mathematical problem!

but 17 is just too far off (so far).

> I hope that you know a more efficient way of converting any number

> to a sequence of digits of a given base than the one I have written.

fame: "On many computers, division is very time consuming and is

to be avoided when possible."

He also gives a neat bag of tricks of calculating the division

remainder of many odd constants, by selectively summing their

digits. This works for numbers n where

2 ^ m = 1 (mod n)

so you can break your number into chunks of m bits, add them

together and still have the same remainder.

Once you have calculated the remainder by repeated addition of

these chunks to a size you can manage, you can then divide by

multiplying with the modular inverse of its number.

This will give you a single digit of your base n number, to

be repeated until the number has been converted to base n.

For base three, 4 = 3+1, so any grouping of bits with

an even number works.

I understand most FPGAs have six-bit lookup tables these days.

For calculating the remainder base three, that is actually

pretty handy - use 12*2 LUTs to reduce the bits from 72

to 24 in one go. Repeat, and you are left with 8 bits,

which is definitely managable.

Of course, then comes the 72*72 bit multiplication, which is

probably going to take some time...

Base 11 and 13 are less friendly, they would need 10 respectively

12 bit lookup tables.

> For "convertBase(m, 2); sum1 = SumArray();" you can use a

> pipelined 'popcount' arquitecture,

elegant popcnt implementation using a 6-bit counter.

May 9, 2021, 4:44:40 PMMay 9

to

Richard Damon <Ric...@Damon-Family.org> schrieb:

> My thought was to build a series of 'Base-X' counters/accumulators, in

> the bases, 2, 3, 5, 7, 11, 13, 17. This is a fairly simple operation,

> especially since the increment value will be a constant equal to the

> number copies of the system. Start with them all at the same value (like

> 0) and just increment them by the same value expressed in their base.

That sounds like a good possibility.

There is one important thing: It is possible to reduce the

amount of work done rather dramatically, and this is also

necessary.

Going to 2^64 with this problem (which I have already done) means

looking at around 1.84e19 numbers. Running at 500e6 Hz and doing

one test per cycle would lead to 3.7e10 seconds running time,

or about 1170 years. Too long.

The serial version of the code consists of nested loops running from

0 to 16. The sum of digits reached so far is easy to calculate,

just add the sum of the digits to the new one. The minimum number

of digits base 17 then is that sum.

It is then possible to calculate the maximum number of bits that the

binary representation in that range can have, and skip the loop

if that is too large.

Example: If the sum of the first five leading digits base 17 is 85,

there is no way that we will find a number with 72 bits whose

popcount is equal to 85.

That has saved a _lot_ of computing effort, at the cost of adding

some complexity to the program.

So, any counter will have to have some rather complicated logic

to make it skip the values where there cannot possibly be a match.

> My thought was to build a series of 'Base-X' counters/accumulators, in

> the bases, 2, 3, 5, 7, 11, 13, 17. This is a fairly simple operation,

> especially since the increment value will be a constant equal to the

> number copies of the system. Start with them all at the same value (like

> 0) and just increment them by the same value expressed in their base.

There is one important thing: It is possible to reduce the

amount of work done rather dramatically, and this is also

necessary.

Going to 2^64 with this problem (which I have already done) means

looking at around 1.84e19 numbers. Running at 500e6 Hz and doing

one test per cycle would lead to 3.7e10 seconds running time,

or about 1170 years. Too long.

The serial version of the code consists of nested loops running from

0 to 16. The sum of digits reached so far is easy to calculate,

just add the sum of the digits to the new one. The minimum number

of digits base 17 then is that sum.

It is then possible to calculate the maximum number of bits that the

binary representation in that range can have, and skip the loop

if that is too large.

Example: If the sum of the first five leading digits base 17 is 85,

there is no way that we will find a number with 72 bits whose

popcount is equal to 85.

That has saved a _lot_ of computing effort, at the cost of adding

some complexity to the program.

So, any counter will have to have some rather complicated logic

to make it skip the values where there cannot possibly be a match.

May 9, 2021, 4:54:35 PMMay 9

to

try to convert an 'aarbirary' number into the various bases.

If you start with the representation of the number X in these bases, it

is very simple to compute X+N in all the bases for a fixed number N. By

starting with N consecutive numbers precomputed in the bases (like the

numbers 1 to N), you would then step through all numbers above that

until some base overflows its storage.

No need for big multipliers or dividers, just simple constant

incrementers. For example, for the base 17 digits, represented with 5

bits, you just need the current 5 bit, the 1 bit carry in, and 5

CONSTANT increment value, so it is simple lookup for each bit. Maybe to

do a bit of work to optimize the carry chain for speed.

May 9, 2021, 5:21:35 PMMay 9

to

I agree with you Richard. I did not thought of that!

Its definitely a very efficient way to process like 200M 72-bit numbers per second (assuming 200MHz in a cheap modern FPGA device with some pipelining) with one small FPGA engine/coprocessor.

Its definitely a very efficient way to process like 200M 72-bit numbers per second (assuming 200MHz in a cheap modern FPGA device with some pipelining) with one small FPGA engine/coprocessor.

May 9, 2021, 7:04:17 PMMay 9

to

On 5/9/21 5:21 PM, Nelson Ribeiro wrote:

> I agree with you Richard. I did not thought of that!

> Its definitely a very efficient way to process like 200M 72-bit numbers per second (assuming 200MHz in a cheap modern FPGA device with some pipelining) with one small FPGA engine/coprocessor.

>

My guess is that that would be the processing rate for a single core
> I agree with you Richard. I did not thought of that!

> Its definitely a very efficient way to process like 200M 72-bit numbers per second (assuming 200MHz in a cheap modern FPGA device with some pipelining) with one small FPGA engine/coprocessor.

>

unit, which will take about 2k Luts.

Reasonable cheap FPGAs will likely handle a small multiple of that.

Maybe getting 10s of copies in middle sized but still reasonably priced.

This does assume that you will be just incrementing through the values.

IF you are able to skip large jumps, that might help you with a

different algorithm, and perhaps that would be worth it. If it is just

occational jumps of large values, perhaps giving up some number of

processors to have a unit that can compute the next possible value and

factor into the needed bases, and then restart there.

I suppose the big question is how big of gaps do you tend to find, If it

can jumps thousands of values, it could well be worth it, and I suppose

it well could be. I could see the binary represtation could establish an

upper bound for the sum of digits, and if higher order digits of some

base exceed that value, you know you need to increment till those

change, which could be a very big jump.

May 10, 2021, 3:09:48 AMMay 10

to

Richard Damon <Ric...@Damon-Family.org> schrieb:

If I limit myself to 72 bits, I have around 4.72237E+21 possible

binary numbers, but "only" 2.91386E+18 eligible numbers base 17 which

have a sum of digits of 72 or less, so this is a reduction by

a factor of 1600 alone, more if you look at the actual ranges

rather than the maximum as I did above. For base 13, the factor

is around 120, for 11 it is 50.

> I suppose the big question is how big of gaps do you tend to find, If it

> can jumps thousands of values, it could well be worth it, and I suppose

> it well could be. I could see the binary represtation could establish an

> upper bound for the sum of digits, and if higher order digits of some

> base exceed that value, you know you need to increment till those

> change, which could be a very big jump.

You are right, the gaps are indeed huge, and the gains enormous.
> can jumps thousands of values, it could well be worth it, and I suppose

> it well could be. I could see the binary represtation could establish an

> upper bound for the sum of digits, and if higher order digits of some

> base exceed that value, you know you need to increment till those

> change, which could be a very big jump.

If I limit myself to 72 bits, I have around 4.72237E+21 possible

binary numbers, but "only" 2.91386E+18 eligible numbers base 17 which

have a sum of digits of 72 or less, so this is a reduction by

a factor of 1600 alone, more if you look at the actual ranges

rather than the maximum as I did above. For base 13, the factor

is around 120, for 11 it is 50.

May 10, 2021, 8:15:00 AMMay 10

to

only using the highest base will probably get you enough to be

practical, and will allow still good speed.

Build the system with 1 (small FPGA), 17 (medium FPGA), or 17^2 (large

FPGA) of these computation cores.

The incrementer rather than being a fixed increment gets the increment

to do from logic looking at the sum of the upper digits of the base 17

number of the first unit, and will add a power of 17 to the current

numbers in all the bases. You will just precompute the powers of 17 in

the bases you are using.

If you start at 0, then the first unit will only roll its upper digits

when all the digits below that digit are zero, so we can rapidly skip by

just adding repeatedly that power of 17 to the sum.

Yes, we could compute a multiple of that power to add to make that digit

roll to 0, but my first guess is that this very likely will cost us more

than the at most 16 cycles to wrap it (needing a number of base-k

multiplies), so better just punt and just add 17^n repeatedly to do it.

If you have only 1 unit, then you could get more complicated skip logic

and let the other bases inject their skips, but you need to be careful

about not lettin yourself add too much and go past the roll over point

as after a skip you might not be at the right nmultiple of the power of

thqt base. The question becomes if it is worth the complexity.

Jun 4, 2021, 3:39:17 AMJun 4

to

Richard Damon <Ric...@Damon-Family.org> schrieb:

I have since implemented two algorithms which gave me an

enormous speedup on traditional CPUs.

Key to both algorithms is a function which returns the range of

the sums of digits base n between integers a and b. For base 2,

this is particularly efficient.

One method is a recursive binary search - given a range between a

and b, it checks if, for all bases looked at, the ranges of sums

of digits intersect. If they don't, return. If they do, partition

into two parts and look again for each one.

The second is the skip function you mentioned above. If it

is given a base-17 number like, it looks for the next largest

number with one more zero digit at the end, like this:

01 03 16 04 03

01 03 16 05 00

01 04 00 00 00

01 04 00 00 00

02 00 00 00 00

(have to watch for carries there) and tests at each stage if the

sum of digits base 2 in the range between the original number

and the new one is still valid.

This is _extremely_ efficient - at a high number range, this can

give skips of 17^10 or so. I alternate this base 17, base 13 and

base 11.

This has allowed me to find numbers which have the same

number of digits in base 17, 13, 11, 5 and 5 (not 3), like

7172806004621143883825103 (which is larger than 2^82). There are

very few of those, and none have so far had the same sum of digits

in base 3.

A key to speed is obviously the time in which a large number

in binary format can be converted into base n. Is an FPGA

the right tool for that?

> On 5/10/21 3:09 AM, Thomas Koenig wrote:

>> Richard Damon <Ric...@Damon-Family.org> schrieb:

>>> I suppose the big question is how big of gaps do you tend to find, If it

>>> can jumps thousands of values, it could well be worth it, and I suppose

>>> it well could be. I could see the binary represtation could establish an

>>> upper bound for the sum of digits, and if higher order digits of some

>>> base exceed that value, you know you need to increment till those

>>> change, which could be a very big jump.

>>

>> You are right, the gaps are indeed huge, and the gains enormous.

>>

>> If I limit myself to 72 bits, I have around 4.72237E+21 possible

>> binary numbers, but "only" 2.91386E+18 eligible numbers base 17 which

>> have a sum of digits of 72 or less, so this is a reduction by

>> a factor of 1600 alone, more if you look at the actual ranges

>> rather than the maximum as I did above. For base 13, the factor

>> is around 120, for 11 it is 50.

>>

>

> It sounds like this skip will be key to processing, and I suspect that

> only using the highest base will probably get you enough to be

> practical, and will allow still good speed.

Maybe a bit of an update here.
>> Richard Damon <Ric...@Damon-Family.org> schrieb:

>>> I suppose the big question is how big of gaps do you tend to find, If it

>>> can jumps thousands of values, it could well be worth it, and I suppose

>>> it well could be. I could see the binary represtation could establish an

>>> upper bound for the sum of digits, and if higher order digits of some

>>> base exceed that value, you know you need to increment till those

>>> change, which could be a very big jump.

>>

>> You are right, the gaps are indeed huge, and the gains enormous.

>>

>> If I limit myself to 72 bits, I have around 4.72237E+21 possible

>> binary numbers, but "only" 2.91386E+18 eligible numbers base 17 which

>> have a sum of digits of 72 or less, so this is a reduction by

>> a factor of 1600 alone, more if you look at the actual ranges

>> rather than the maximum as I did above. For base 13, the factor

>> is around 120, for 11 it is 50.

>>

>

> It sounds like this skip will be key to processing, and I suspect that

> only using the highest base will probably get you enough to be

> practical, and will allow still good speed.

I have since implemented two algorithms which gave me an

enormous speedup on traditional CPUs.

Key to both algorithms is a function which returns the range of

the sums of digits base n between integers a and b. For base 2,

this is particularly efficient.

One method is a recursive binary search - given a range between a

and b, it checks if, for all bases looked at, the ranges of sums

of digits intersect. If they don't, return. If they do, partition

into two parts and look again for each one.

The second is the skip function you mentioned above. If it

is given a base-17 number like, it looks for the next largest

number with one more zero digit at the end, like this:

01 03 16 04 03

01 03 16 05 00

01 04 00 00 00

01 04 00 00 00

02 00 00 00 00

(have to watch for carries there) and tests at each stage if the

sum of digits base 2 in the range between the original number

and the new one is still valid.

This is _extremely_ efficient - at a high number range, this can

give skips of 17^10 or so. I alternate this base 17, base 13 and

base 11.

This has allowed me to find numbers which have the same

number of digits in base 17, 13, 11, 5 and 5 (not 3), like

7172806004621143883825103 (which is larger than 2^82). There are

very few of those, and none have so far had the same sum of digits

in base 3.

A key to speed is obviously the time in which a large number

in binary format can be converted into base n. Is an FPGA

the right tool for that?

Jun 8, 2021, 11:44:10 AMJun 8

to

Well, I personally don't know any algorithm to convert a "large number in binary format into any base n, with n being a prime number" that would be a good fit for FPGAs.

Naïve algorithms make use of division operations, and FPGA are not good at these operations.

But that skip method seems to be very promising... but it may need a lot of investigation/exploration/analysis/research from my point of view....But I really can see that the gain in skipping values is really major, I simply cannot think right now in a good "architecture" to implemented it!

What I can show you is where FPGAs shine. I wrote a module in Verilog code that can be synthesizable at 100 MHz (barely!) for a Zynq 7020 when retiming is used (basically I did not pipelined the design, but used some of the tools options that tries to do it for me) and that makes use of around 2100 LUTS.

The concept idea for the system would be the following:

There would be an application running in an PC (written in C, C#, Python, whatever language it would be preferred) that would create jobs to be distributed to boards with FPGA devices (either through Ethernet, or simply through UART). In a FPGA device it would exist at least one (Soft) processor connected to many of these modules, to which those jobs are distributed. These jobs would consist in 2 72-bit numbers, one at which the processing would start, another at which the processing would end. (The module requires that the Start Number would be converted to each n Base by the (Soft) Processor before it starts processing).

The description of the module is the following:

For each base (2, 3, 5, 7, 11, 13, 17) there is a counter of that base, which at every and each 1 clock cycle advances one unit. In pipeline and in parallel with these counters there is a tree of adders ( well for base 2 the popcount module is used) to sum up all the "digits" values of that number for each base. To avoid adders with more than 7 bits, overflow flag is used whenever a sum does not fit in 7 bits. At one point every adder of each base n is compared with each other. If none overflowed, and if all have the same value then this is a relevant value, and outputs this signal.

The module that I designed is not finished, is a proof of concept, it may have bugs, but has been designed to show how to generate sequences A135127, A212222, A335839 and the next sequence of “Integers whose sum of digits in base b is the same for every prime b up to 17.”

It can be found in: https://www.edaplayground.com/x/AUaM

For the people that will look into the code, sorry for the lack of comments… just wanted to try out some things. For instance, a counter with 72 bits running at 100MHz is not something that I have done in the past (for low end FPGAs), or tree adders with like 20 7-bit adders running at the same 100MHz… These would ideally be pipelined by hand, but I got away with the tools options, after inserting a few pipeline registers. The 100 MHz objective would be easier to achieve if less bits would be used…

One module like this would process 100M numbers per second, and for the A335839 sequence would process up to 4294899857375(base 10) in half a day (Am I doing the math correctly? 4294899857375 / 100000000 /3600/ 24 = 0.49709!)

Don’t know why your target is 72-bits and above, but with less bits (lets say 64-bits numbers) 100MHz would be better achievable, and in a 150€ FPGA board, it would fit 12/15 of these modules.

Regards,

Nelson

Naïve algorithms make use of division operations, and FPGA are not good at these operations.

But that skip method seems to be very promising... but it may need a lot of investigation/exploration/analysis/research from my point of view....But I really can see that the gain in skipping values is really major, I simply cannot think right now in a good "architecture" to implemented it!

What I can show you is where FPGAs shine. I wrote a module in Verilog code that can be synthesizable at 100 MHz (barely!) for a Zynq 7020 when retiming is used (basically I did not pipelined the design, but used some of the tools options that tries to do it for me) and that makes use of around 2100 LUTS.

The concept idea for the system would be the following:

There would be an application running in an PC (written in C, C#, Python, whatever language it would be preferred) that would create jobs to be distributed to boards with FPGA devices (either through Ethernet, or simply through UART). In a FPGA device it would exist at least one (Soft) processor connected to many of these modules, to which those jobs are distributed. These jobs would consist in 2 72-bit numbers, one at which the processing would start, another at which the processing would end. (The module requires that the Start Number would be converted to each n Base by the (Soft) Processor before it starts processing).

The description of the module is the following:

For each base (2, 3, 5, 7, 11, 13, 17) there is a counter of that base, which at every and each 1 clock cycle advances one unit. In pipeline and in parallel with these counters there is a tree of adders ( well for base 2 the popcount module is used) to sum up all the "digits" values of that number for each base. To avoid adders with more than 7 bits, overflow flag is used whenever a sum does not fit in 7 bits. At one point every adder of each base n is compared with each other. If none overflowed, and if all have the same value then this is a relevant value, and outputs this signal.

The module that I designed is not finished, is a proof of concept, it may have bugs, but has been designed to show how to generate sequences A135127, A212222, A335839 and the next sequence of “Integers whose sum of digits in base b is the same for every prime b up to 17.”

It can be found in: https://www.edaplayground.com/x/AUaM

For the people that will look into the code, sorry for the lack of comments… just wanted to try out some things. For instance, a counter with 72 bits running at 100MHz is not something that I have done in the past (for low end FPGAs), or tree adders with like 20 7-bit adders running at the same 100MHz… These would ideally be pipelined by hand, but I got away with the tools options, after inserting a few pipeline registers. The 100 MHz objective would be easier to achieve if less bits would be used…

One module like this would process 100M numbers per second, and for the A335839 sequence would process up to 4294899857375(base 10) in half a day (Am I doing the math correctly? 4294899857375 / 100000000 /3600/ 24 = 0.49709!)

Don’t know why your target is 72-bits and above, but with less bits (lets say 64-bits numbers) 100MHz would be better achievable, and in a 150€ FPGA board, it would fit 12/15 of these modules.

Regards,

Nelson

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu