Solving this problem in Lisp: https://www.ruby-forum.com/topic/6877379#new

TwirlwT

unread,

Dec 16, 2015, 3:30:49 PM12/16/15

to

Hello, The problem posted in Ruby forum seems interesting
https://www.ruby-forum.com/topic/6877379#new

I have a working solution in Ruby that I will not disclosed. I am sure Lisper can do better, my first bound is about 10000 bases. Can you do it better?

Marco Antoniotti

unread,

Dec 16, 2015, 4:33:30 PM12/16/15

to

(compile-file "the-solution.lisp") will certainly yield faster code. Provided the algorithm used were comparable.

Cheers
--
MA

TwirlwT

unread,

Dec 17, 2015, 12:12:11 PM12/17/15

to

----------------

A solution in Lisp is given below. The argument k to f=longest-common-substr indicates that the substrings should be separated by at least k bases, to avoid overlapping. The optimum feasible solution must be no overlapping and so f(k)>=k. Since f(k) is a decreasing function of k, the optimum is the largest k such that f(k)>=k. The main idea is to employ an array of indexes for the bases. The main idea is to sort the array of indexes using

i < j <=> (string< bases bases :start1 i :start2 j). The rest follows easily.

Curiously, the result obtained indicates that in the X chromosome of the Drosophila exists big portions that are shifted copies, providing some insight about the evolution of this species (I suppose so).

Finally, in ruby I try to sort the array of indexes using a similar idea but it was very slow. My attempt was to sort_by {|i,j| bases[i .. -1] <=> bases[j .. -1] }, but the sorting took a very long time compared to sbcl, perhaps there better way of sorting that I don't know in ruby.

* (load "mosquito.lisp")
longest-common-substr for k = 11505 is (11508 53395)
longest-common-substr for k = 11506 is 2639
Hence, the greatest k such that f(k) >= k is 11505
The substrings are located at positions 27441, 15935 and 39112

-----------------------------------------------------------------
Source code follows:

(defparameter bases (with-open-file (f "mosca.txt"
:direction :input
:element-type 'base-char)
(read-line f)))

(defun ordena()
(let ((anum (coerce (loop for i below (length bases) collect i) 'vector)))
(sort anum (lambda(i j) (string< bases bases :start1 i :start2 j)))))

(defparameter anum (ordena))

(defun mis(i j)
(let ((i1 (aref anum i))
(j1 (aref anum j)))
(- (mismatch bases bases :start1 i1 :start2 j1) i1)))

(defun longest-common-substr(k)
"Compute the longest common substring,
The parameter k is used to guarantee that each pair of substring are
at a distance greater than k. The condition f(k)>=k guarantees that
there is not overlapping so that the solution obtained is feasible.

Finally, since f(k) is a decreasing function of k, the
longest feasible solution is the greatest k such that f(k) >= k. Such
k can be found with a loop such as: while f(k)>=k increase k, but is
better to look for k using binary search.

Note that the ordering used for the indices implies that
mismatch(i, i+p) = k0 then mismatch(i,j)>= k0 for each j with i<=j<=k0.
"

(loop with maxi = 0 with pos = 0 with aux = 0
for i upfrom 0 below (- (length anum) 3)
for a = (aref anum i)
for b = (aref anum (1+ i))
for c = (aref anum (+ 2 i))
when (and (> (abs (- a b)) k)
(> (abs (- a c)) k)
(> (abs (- b c)) k))
do
(progn
(setq aux (- (mismatch bases bases :start1 a :start2 c) a))
(when (> aux maxi)
(setq maxi aux
pos i)))
finally (return (values maxi pos))))

(multiple-value-bind (maxlen location)
(longest-common-substr 11505)
(format t "longest-common-substr for k = 11505 is ~S~%" (list maxlen location))
(format t "longest-common-substr for k = 11506 is ~S~%"
(longest-common-substr 11506))
(format t "Hence, the greatest k such that f(k) >= k is 11505~%")
(format t "The substrings are located at positions ~D, ~D and ~D~%"
(aref anum location) (aref anum (1+ location)) (aref anum (+ 2 location))))

---------------------------------

Dedicated to the drosophila for her huge contribution for the advance of the biological sciences.

WJ

unread,

Dec 18, 2015, 1:13:58 AM12/18/15

to

MatzLisp (Ruby):

text = IO.binread("Drosophila_melanogaster_chromX.txt")

t = Time.now
@suffix_array = (0 ... text.size).sort_by{|i| text[i .. -1]}
printf "Sorting took %.1f seconds.\n", Time.now - t

def suffix_size(i)
@suffix_array.size - @suffix_array[i]
end

def overlap?(locations, pos, min_size)
locations.map{|p| (p - pos).abs}.min < min_size
end

reps = 3
min_size = 11506

(text.size - reps + 1).times{|i|
next if suffix_size(i) < min_size
pos = @suffix_array[i]
locations = [pos]
substr = text[ pos, min_size ]
(i+1).upto(text.size - 1){|j|
next if suffix_size(j) < min_size
pos = @suffix_array[j]
next if overlap?(locations, pos, min_size)
break if substr != text[pos,min_size]
locations << pos }
if locations.size >= reps
p locations.sort
end }

I used the complete 3 megabyte file, not the 500_000 byte file that
you mentioned in the Ruby forum.

Output:

Sorting took 17.3 seconds.
[15935, 27441, 39112]
[15936, 27442, 39113]
[15937, 27443, 39114]

TwirlwT

unread,

Dec 18, 2015, 1:00:58 PM12/18/15

to

Hello, thanks for the ruby version, (sorry to the rest of the list if this is off-topic here)

I didn't considered sort_by because, at first sight, creating the table with 3 Millions x 3 Millions bytes appears to be a terrible idea. It happens that Ruby use a copy on write scheme, so that it only create structures with displaced pointers. I have tried to use a similar scheme in Lisp using displaced-arrays but the performance is very bad in this case.

Just for comparison, in one of my old computers, the ruby code and the Lisp code both took approximately the same amount of time, about 19 seconds, when used displaced arrays it takes 85 seconds.

So here I have to acknowledge that the performance of ruby is very good at sorting. Kudos to the ruby developers.

Just nick-picking, in your code you use: locations.map{|p| (p - pos).abs}.min it could be abbreviated to locations.min{|p| (p-pos).abs}, since you don't need to construct the list just to compute the minimum.

As I like to learn many programming languages, I like to read some of your post, the only caveat is that this isn't the right place to do so but it keeps posts flowing, what is good.

-------------------------------------------

ruby wj.rb

Sorting took 18.9 seconds.

[15935, 27441, 39112]
[15936, 27442, 39113]
[15937, 27443, 39114]

(defun ordena-old()
(let ((anum (make-array (length bases)
:element-type '(unsigned-byte 30)
:initial-contents (loop for i below (length bases) collect i))))
(declare (type (simple-array base-char) bases)
(sort anum (lambda(i j) (string< bases bases :start1 i :start2 j))))))

(defun ordena()
(let* ((len1 (length bases))
(anum (make-array len1
:element-type '(unsigned-byte 30)
:initial-contents (loop for i below len1 collect i))))
(declare (type (simple-array base-char) bases))
(sort anum
#'string<
:key (lambda(i)
(make-array (- len1 i) :element-type 'base-char
:displaced-to bases :displaced-index-offset i)))))

(defun ordena-old()
(let ((anum (make-array (length bases)
:element-type '(unsigned-byte 30)
:initial-contents (loop for i below (length bases) collect i))))
(declare (type (simple-array base-char) bases)
(sort anum (lambda(i j) (string< bases bases :start1 i :start2 j))))))

CL-USER> (time (progn (ordena-old) nil))
Evaluation took:
0.193 seconds of real time
0.192000 seconds of total run time (0.180000 user, 0.012000 system)
99.48% CPU
441,885,614 processor cycles
37,741,296 bytes consed

NIL
STYLE-WARNING: redefining COMMON-LISP-USER::ORDENA in DEFUN
CL-USER> (time (progn (ordena) nil))
Evaluation took:
84.704 seconds of real time
84.692000 seconds of total run time (83.864000 user, 0.828000 system)
[ Run times consist of 2.208 seconds GC time, and 82.484 seconds non-GC time. ]
99.99% CPU
194,353,507,546 processor cycles

tar...@google.com

unread,

Dec 18, 2015, 1:30:21 PM12/18/15

to

On Friday, December 18, 2015 at 10:00:58 AM UTC-8, TwirlwT wrote:
> I didn't considered sort_by because, at first sight, creating the table with
> 3 Millions x 3 Millions bytes appears to be a terrible idea. It happens that
> Ruby use a copy on write scheme, so that it only create structures with
> displaced pointers. I have tried to use a similar scheme in Lisp using
> displaced-arrays but the performance is very bad in this case.

Well, the problem with the displaced array approach is that you are creating
millions of new arrays to get the displaced arrays. That is a lot of overhead.
And there is no requirement that CL's SORT only call the KEY function once per
element. It can call it each time it accesses the element. But in any case it
does create an awful lot of arrays that are then just thrown away as garbage.
Even though the underlying data isn't copied, you still have to create the
array object.

When you instead use the :START1 and :START2 arguments to STRING<, there is
no object creation and the comparison just uses pointers into the underlying
string.

> Just for comparison, in one of my old computers, the ruby code and the Lisp
> code both took approximately the same amount of time, about 19 seconds, when
>used displaced arrays it takes 85 seconds.

Actually, from what you post below, the original lisp code took 0.19 seconds.

It would have been interesting to see the difference in consing as well.

Carlos

unread,

Dec 18, 2015, 4:25:52 PM12/18/15

to

[TwirlwT <danieltor...@gmail.com>, 2015-12-18 10:00]
[...]

> > @suffix_array = (0 ... text.size).sort_by{|i| text[i .. -1]}

[...]

> I didn't considered sort_by because, at first sight, creating the
> table with 3 Millions x 3 Millions bytes appears to be a terrible
> idea. It happens that Ruby use a copy on write scheme, so that it
> only create structures with displaced pointers. I have tried to use
> a similar scheme in Lisp using displaced-arrays but the performance
> is very bad in this case.
>
>
> Just for comparison, in one of my old computers, the ruby code and
> the Lisp code both took approximately the same amount of time, about
> 19 seconds, when used displaced arrays it takes 85 seconds.

[...]

> (defun ordena()
> (let* ((len1 (length bases))
> (anum (make-array len1
> :element-type '(unsigned-byte 30)
> :initial-contents (loop for i below len1
> collect i)))) (declare (type (simple-array base-char)
> bases)) (sort anum
> #'string<
> :key (lambda(i)
> (make-array (- len1 i) :element-type 'base-char
> :displaced-to
> bases :displaced-index-offset i)))))

But here you are creating a displaced array every time #'sort needs to
know the value of an element. That's not what Ruby's #'sort_by does. It
creates an array with all the keys and then sorts it. The Lisp
equivalent should be something like this:

(defun ordena (bases)
(let* ((len1 (length bases))
(sort-keys (make-array len1

:initial-contents
(loop for i below len1

collect (make-array (- len1 i)
:element-type 'character
:displaced-to bases
:displaced-index-offset i))))
(sorted-keys (sort sort-keys #'string<)))
(coerce (loop for key across sorted-keys
collect (1- (length key)))
'vector)))

I can't test the sorting because my SBCL dies with a heap exhausted
error (I wonder how much an array header takes...), but it should be
quicker. Not as quick as your first method of sorting directly on the
string using :start1 and :start2.
--

TwirlwT

unread,

Dec 18, 2015, 6:36:40 PM12/18/15

to

El viernes, 18 de diciembre de 2015, 22:25:52 (UTC+1), Carlos escribió:
> [TwirlwT 2015-12-18 10:00]

Hello Carlos, I think that your implementation should be the following, it takes 60 seconds (my machine is a x86 32bits dual core, 2.3 gigahertz.

(defun ordena (bases)
(let* ((len1 (length bases))

(anum (make-array len1 :element-type 'fixnum

:initial-contents (loop for i below len1 collect i)))

(sort-keys (make-array len1
:initial-contents
(loop for i below len1
collect (make-array (- len1 i)

:element-type 'base-char
:displaced-to bases
:displaced-index-offset i)))))

(sort anum #'string< :key (lambda(i) (aref sort-keys i)))))

CL-USER> (time (progn (print "Sorting ... ") (setq anum (ordena bases)) nil))

"Sorting ... "
Evaluation took:
60.552 seconds of real time
60.504000 seconds of total run time (60.292000 user, 0.212000 system)
[ Run times consist of 0.992 seconds GC time, and 59.512 seconds non-GC time. ]
99.92% CPU
138,935,477,241 processor cycles
201,702,200 bytes consed

By the way, as indicated by Tar, the fast way of sorting takes 0.19 seconds a lot better than ruby, here CL shine.

Marco Antoniotti

unread,

Dec 19, 2015, 6:41:41 AM12/19/15

to

What implementation are you using?

MA