GSOC Introduction

Shubham thorat

unread,

Mar 8, 2020, 8:37:11 AM3/8/20

to sympy

hello everyone,

I am Shubham Thorat, a third-year Electronics undergraduate at VIT, Pune India. I have been using python for the last 4 years, I participate in coding competitions on a regular basis.

Domain Of Interest

Number Theory
Linear Algebra
Applied Physics
Discrete Mathematics
Calculus
Probability and Statistics
Machine Learning
Signals and systems
Data Compression
Control Systems
Computer Architecture
Operating Systems
Advance Data Structure
OOPS
Design and Analysis Of Algorithm
Digital /Analog Electronics
Microprocessors and Microcontrollers

Projects and other roles:

Ekasutram (Mathematics Club) (Pune, Maharashtra) 08 2018 - Present
- Title – Club Head
  - The club conducts activities like mathematics seminar on number theory.
  - Application of mathematics in coding through coding competition.
  - The setting of problems for a coding competition and the mathematical question of the week.
  - Organizing a session like Machine Learning without using libraries and activities related to game theory.

HackerEarth (Pune, Maharashtra) 07 2019 – 09 2019 (Django, Machine Learning)
- Title – Missing Hackathon
  - An application that utilizes available digital footprint in an ML model.
  - Goal is to suggest the location of a missing person.
  - An architecture is created for different designation of police officials.
  - Everyone with their own portal to perform activities like assigning cases, uploading documents etc.
Kneo Automation Pvt Ltd (Pune, Maharashtra) 05 2019 - 07 2019 (Django, ML in python)
- Title – Project Intern
  - The project involved the building of Machine Learning models.
  - Input parameters included bd_tracedate, bd_rise, bd_fall, bd_event_code.
  - The goal of this project was to predict the errors (event code) occurring in machines.
Hacknight1.0 sponsored by Microsoft (Bengaluru, Karnataka) 07 2019 - 08 2019
- Title – Hackathon
  - Built a UI on top of a Machine Learning model which predicts whether an employee should be given leave or not.
  - There were 27 Input parameters like timestamp, age, gender, country, self-employed, family
  - history, work interfere, number of employees, benefits, welfare program, comments etc.
  - competitive round score - 97.370000
  - ML round score – 81.853282
Title – Hospital Management System (Django, azure)
- An application that manages hospital such as booking an appointment, updating doctor's available hours, patient diagnosis and prescription.
Title – Handwritten digit recognition
- Numbers are recognized using Neural Network using octave.
Title – MyOS (C, assembly)

An OS that can be installed on any 32-bit machine.

I have started sending pull requests to sympy. Currently, I am understanding the flow of sympy in deep. I am interested in working on Floating Point expressions optimization.

shubham

unread,

Mar 8, 2020, 9:47:58 AM3/8/20

to sympy

Dear Developers,

I am Shubham pursuing my graduation in CSE with specialization in Machine Learning . I am comfortable with python , c , c++ and very active in creating projects . Mathematics is core part for a programmer and so for me . I have been going through Sympy documentation for a month . So , i have little familiarity with sympy and feel I can really contribute to it.

i am comfortable with :

Linear Algebra
Probability and Statistics
Transform Calculus
Differential equation
Numerical solution of ordinary and partial differential equations

So, can you suggest me some project ?

Regards

Shubham

Shubham thorat

unread,

Mar 8, 2020, 10:47:41 AM3/8/20

to sympy

I am referring the given links for optimizing floating-point expressions

https://herbie.uwplse.org/

https://ece.uwaterloo.ca/~dwharder/NumericalAnalysis/02Numerics/Double/paper.pdf

To Do

1] databases of rules to find rearrangements(the database can be updated by user)
- Including those for the commutativity, associativity, distributivity
- Identity of basic arithmetic operators
- Fraction arithmetic
- Laws of squares, square roots, exponents, logarithms
- Some basic facts of trigonometry
2] sampling points
- These input points are sampled uniformly from the set of floating point bit patterns.
- Each sampled point is a combination of a random mantissa, exponent, and a sign bit.
- By sampling exponents uniformly, We can generate both very small and very large input points, as well as inputs of normal size
3] error calculation
- STOKE: error = no_of_floating_point_values_between(approximate and exact_answer)
- E(x, y) = log 2 |{z ∈ FP | min(x, y) ≤ z ≤ max(x, y)}|
- mpmath can be used for this operation.
4] localizing the error points and operators
- We use a greedy, hill-climbing search to tracks a set of candidate programs
- apply rules at various locations in these candidates, evaluates each resulting program, and repeats the process on the best candidates.
- For polynomial approximations, we can use sympy specialized series expansion pass to handle inaccuracies near 0 and ±∞
7] recursively solving the subparts
- Simplifying the equation
- Replacing the expression using the different equations from the database.
- Series expansion
- recursively solving the candidate keys.
5] finding operations with the highest error
- using srepr tree for getting subexpressions.
- For each operation, we will average the local error for all sample points,
- focuse its rewrites at the operations with the highest average
- simplifying rewrites after every step.
6] Updating program Table
- keeping only those candidates that achieve the best accuracy on at least one sample point.
- store the processed program
- after adding the candidate it might happen that previously added are no longer best so they can also be removed from the table.
- above mentioned paper suggests that in practice, running until the table reaches saturation does not give better results than running for 3 iterations.
7] returning the split Piecewise expression for different ranges
- based on the sample points

Functions (refering the above mentioned lines), I want to implement the following functions for my GSOC project

Definition-main(program) :

points := sample-inputs(program)

exacts := evaluate-exact(program, points)

table := make-candidate-table(simplify(program))

repeat N times

candidate:= pick-candidate(table)

locations := sort-by-local-error(all-locations(candidate))

locations.take(M )

rewritten := recursive-rewrite(candidate, locations)

new-candidates := simplify-each(rewritten)

table.add(new-candidates)

approximated := series-expansion(candidate, locations)

table.add(approximated)

return infer-regimes(table).as-program

Definition local-error(expr, points) :

for point ∈ points :

args := evaluate-exact(expr.children)

exact-ans := F(expr.operation.apply-exact(args))

approx-ans := expr.operation.apply-approx(F(args))

accumulate E(exact-ans, approx-ans)

Definition recursive-rewrite(expr, target) :

select and where are non-deterministic choice and requirement

select input

output from RULES

where input.head = expr.operator ∧ output.head = target.head

for (subexpr, subpattern) ∈ zip(expr.children, input.children) :

if ¬matches(subexpr, subpattern) :

recursive-rewrite(subexpr, subpattern)

where matches(expr, input)

expr.rewrite(input -> output)

Definition simplify(expr) :

iters := iters-needed(expr)

egraph := make-egraph(expr)

repeat iters times :

for node ∈ egraph :

for rule ∈ SIMPLIFY - RULES :

attempt-apply(rule, node)

return extract-smallest-tree(egraph)

Definition iters-needed(expr) :

if is-leaf(expr) :

return 0

else :

sub-iters := map(iters-needed, expr.children)

iters-at-node := if is-commutative(expr.head) then 2 else 1

return max(sub-iters) + iters-at-node

Definition infer-regimes(candidates, points) :

for x i ∈ points :

best-split 0 [x i ] = [regime(best-candidate(−∞, x i ), −∞, x i )]

for n ∈ N until best-split n+1 = best-split n :

for x i ∈ points ∪ {∞} :

for x j ∈ points, x j < x i :

extra-regime := regime(best-candidate(x j , x i ), x i , x j )

option[x j ] := best-split[x j ] ++ [extra-regime]

best-split n+1 [x i ] := lowest-error(option)

if best-split n [x i ].error − 1 ≤ best-split n+1 [x i ].error :

best-split n+1 [x i ] := best-split n [x i ]

split := best-split ∗ [∞]

split.refine-by-binary-search()

return split

Please share your insights on this topic.

Shubham thorat

unread,

Mar 11, 2020, 6:48:00 PM3/11/20

to sympy

This is how I have divided the tasks:

The algorithm is defined in this paper: https://herbie.uwplse.org/pldi15-paper.pdf

Error Calculation on sampling points

Find the best-accepted level of precision so we calculate the actual correctly up to 64 bits by increasing the precision until we get constant up to 64 bits.
Calculate the value using float( hardware precision).
Compare real and float value by calculating E(x,y) = log(2,z), z = number of floating-point between real and approximate answers.
averaging these differences over the sampling to see how accurate the expression is.

Pick Candidate

Pick candidates (subexpression) from the table and apply error calculation as well as a local error at each location on the sampled points.
the database will have a set of rewrite rules like commutativity, associativity, distributivity, (x + y) = (x**2 - y**2)/(x - y), (x - y) = (x**3 - y**3)/(x**2 + y**2 + x*y), x = exp(log( x )) etc..

Recursive- rewrite

Rewrite candidates using the database of rules and simplify to cancel terms.
Recursively repeat the algorithm on the best subexpression.

Series Expansion

Finding the series expansion of expressions to remove error near 0 and infinity.

Candidate Tree

Only keep those candidates that give the best accuracy on at least one location.
On every iteration of the outer loop, the algorithm chooses the program from the table and uses it to find new candidates, every program is used once.
Candidate programs are also saved as a pair of maps that are tied with the location that they are best at.
removing candidates if more than one candidate is giving the same results based on their results at other locations.
Before the series approximation step, we will use the set cover approximation algorithm to prune candidates to have a minimal set.

Get Piecewise solutions

Split is found using dynamic programming and later refined using binary search.

These are the functions:

Definition-main(program) :

points := sample-inputs(program)

exacts := evaluate-exact(program, points)

table := make-candidate-table(simplify(program))

repeat N times

candidate:= pick-candidate(table)

locations := sort-by-local-error(all-locations(candidate))

locations.take(M )

rewritten := recursive-rewrite(candidate, locations)

new-candidates := simplify-each(rewritten)

table.add(new-candidates)

approximated := series-expansion(candidate, locations)

table.add(approximated)

return infer-regimes(table).as-program

Definition local-error(expr, points) :

for point ∈ points :

args := evaluate-exact(expr.children)

exact-ans := F(expr.operation.apply-exact(args))

approx-ans := expr.operation.apply-approx(F(args))

accumulate E(exact-ans, approx-ans)

Definition recursive-rewrite(expr, target) :

select input

output from RULES

where input.head = expr.operator ∧ output.head = target.head

for (subexpr, subpattern) ∈ zip(expr.children, input.children) :

if ¬matches(subexpr, subpattern) :

recursive-rewrite(subexpr, subpattern)

where matches(expr, input)

expr.rewrite(input -> output)

Definition infer-regimes(candidates, points) :

for x i ∈ points :

best-split 0 [x i ] = [regime(best-candidate(−∞, x i ), −∞, x i )]

for n ∈ N until best-split n+1 = best-split n :

for x i ∈ points ∪ {∞} :

for x j ∈ points, x j < x i :

extra-regime := regime(best-candidate(x j , x i ), x i , x j )

option[x j ] := best-split[x j ] ++ [extra-regime]

best-split n+1 [x i ] := lowest-error(option)

if best-split n [x i ].error − 1 ≤ best-split n+1 [x i ].error :

best-split n+1 [x i ] := best-split n [x i ]

split:= best-split ∗ [∞]

split.refine-by-binary-search()

return split

I have written a basic brute force code.

from sympy import *
import numpy as np

x = Symbol("x")
y = Symbol("y")
z = Symbol("z")


#points
maxi = 1000000000000000000000000000000
mini = -1000000000000000000000000000000
step = (maxi-mini)/256

start = mini
points = []

for i in range(0,256):
    points.append(start)
    start += step

#calculate error
def calc_error(expr,point):
    from mpmath import mp, mpf
    mp.dps = 1000
    symb = expr.atoms(Symbol)
    unq_sym = len(symb)


    subst_sym = []
    subst_mpf = []

    i=0
    for sym in symb:
        subst_sym.append((x,point))
        #subst_sym.append((x,300))
        #subst_sym.append((z,400))                
        subst_mpf.append( ( sym,mpf(point) ) )
        i = i+1

    ans1 = expr.subs(subst_sym)
    ans2 = expr.subs(subst_mpf)
    return ans1,ans2

#replacement functions
#database 
rule = []
rule.append(lambda exp: (exp.args[0]**3 + exp.args[1]**3)/(exp.args[0]**2 + exp.args[1]**2 - exp.args[1]*exp.args[0]) if type(exp) ==Add and len(exp.args)==2 and (type(exp.args[0]) != Symbol and type(exp.args[1]) != Symbol) else False)
rule.append(lambda exp: (exp.args[0]**2 - exp.args[1]**2)/(exp.args[0]-exp.args[1]) if type(exp) ==Add and len(exp.args)==2 and (type(exp.args[0]) != Symbol and type(exp.args[1]) != Symbol)else False)
#rule.append(lambda expr: exp(log(expr)))

#rule1 = lambdify([x, y, (x**2 - y**2)/(x+y))

expr  = (x+1)**0.5 - x**0.5
#expr = x+1-x #+ (1/(x-1)) - (2/x)
#expr = (-y + (-4*x*z + y**2)**0.5)/(2*x)
#expr = (x+1)**(1/3) - x**(1/3)
#expr  = (x+1)**0.25 - x**0.25 
expr = simplify(expr)
main_expr = expr
temp_expr = expr
pprint(rule[0](expr))
print(rule[1](expr))

all_func = []
mapper = dict()

def pre(expr):
    for fi in rule:
        print("see ",expr)
        for pnt in points:
            for fi in rule:
                k = fi(expr)
                if k!=False:                
                    ans1, ans2 = calc_error(expr, pnt)
                    update_func = main_expr.subs(expr, k)
                    ans3, ans4 = calc_error(update_func, pnt)
                    diff1 = ans2 - ans1
                    diff2 = ans4 - ans3
                    #print("istrue ", ans4==ans2)
                    #print("check1 ", ans4)
                    #print("check2 ", ans2)
                    #print("difference ", diff1, " ", diff2)
                    try:
                        if abs(diff2) <= abs(diff1):   #(diff2) <= (diff1)                 
                            all_func.append(update_func)
                            print("inn ", pnt)
                            try:
                                lister = mapper[update_func]
                                
                                lister.append(pnt)
                                
                                mapper[update_func] = lister
                            except:
                                mapper[update_func] = [pnt]  
                    except:
                        print("failed ",pnt)
                        #print(abs(diff1), " ", abs(diff2))
                        #gprint(isinstance(diff2, complex))
                        #print(update_func)
                        #exit()        
            #all_func.append(simplify(main_expr.subs(expr,k)))
    print(expr, " space ",)
    for arg in expr.args:
        pre(arg)

if len(all_func)==0:
    all_func.append(simplify(main_expr))

pre(expr)
#print(srepr(expr))
#pprint(set(all_func))
#print("mapper")
#print(mapper)
values = list(set(all_func))
i=0
for expr in values:
    values[i] = expr.replace( lambda expr: type(expr)== Pow and expr.args[1]==1, lambda expr: Mul(expr.args[0],1))
    i = i+1
pprint(values)Enter code here...

Output:

1.5 1.5

- x + (x + 1) 1 0.5 0.5

────────────── , ────────── , - x + (x + 1)

0.5 0.5 0.5 0.5

x ⋅ (x + 1) + 2⋅x + 1 x + (x + 1)

please tell me your thoughts on this, any addition, deletion of algorithmic improvement.

Oscar Benjamin

unread,

Mar 11, 2020, 7:33:32 PM3/11/20

to sympy

It's not clear to me what problem your proposal is intending to solve.
In what way is it different from the current evalf algorithm/results?

--
Oscar

> --
> You received this message because you are subscribed to the Google Groups "sympy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sympy+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/f94a7d30-dc4c-4ba0-a333-e67ff1b63513%40googlegroups.com.

Shubham thorat

unread,

Mar 12, 2020, 11:47:01 AM3/12/20

to sympy

The current evalf() is used to evaluate a numerical expression into a floating-point number using an arbitrary precision library mpmath.

What I want to do is to get the best answer for different ranges that varies from (-inf, inf) without increasing precision.

for example:

expr1 = (x + 1)**0.5 - x**0.5

when we calculate it for x = 10000000000000000.98698698, we get 0

- expr1.subs(x,10000000000000000.98698698) = 0

but if we rewrite the expr1 as 1/(x**0.5 + (x + 1)**0.5), we get a better answer without catastrophic cancellation as in the previous case.

- expr2 = 1/(x**0.5 + (x + 1)**0.5)

- expr2.subs(x,10000000000000000.98698698) = 5.00000000000000e-9

I am proposing to implement this paper: https://herbie.uwplse.org/pldi15-paper.pdf

- Where an expression can be rewritten to get the best possible answer without increasing the precision.

- The rewriting database will have different function and properties like commutativity, associativity, distributivity, (x + y) = (x**2 - y**2)/(x - y), (x - y) = (x**3 - y**3)/(x**2 + y**2 + x*y), x = exp(log( x )) etc.

- I want to create a class where a symbolic expression along with its optional range is given as an input, which will be rewritten to get the best possible expression.

Please correct me if I have made mistake in understanding the things, also suggest the scope and changes for this.

Chris Smith

unread,

Mar 13, 2020, 4:44:40 PM3/13/20

to sympy

The paper about Herbie is very interesting. Since SymPy is primarily symbolic it would work well at producing expressions via Herbie that would perform well -- an optimized expression for evaluation. It reminds me a little of the Fu-heuristics that are used for trigonometric simplifications.

/c

Shubham thorat

unread,

Mar 13, 2020, 6:01:34 PM3/13/20

to sympy

Herbie also uses heuristic search estimates and localize error using sampled points.
Same as in Fu et al, Herbie applies a database of rules and follows an algorithm that recursively rewrites candidate programs.

I have attached a pdf that includes the rules that have to there in the database.

rewrite_rules.pdf

Shubham thorat

unread,

Mar 17, 2020, 5:15:27 AM3/17/20

to sympy

I have created the first draft of my proposal for "Optimising floating-point expressions",

https://docs.google.com/document/d/12eBRsCbZkBfh4pj2MAaurdMWx6nd1MTYxXTmlRKriJ8/edit?usp=sharing