----------------
A solution in Lisp is given below. The argument k to f=longest-common-substr indicates that the substrings should be separated by at least k bases, to avoid overlapping. The optimum feasible solution must be no overlapping and so f(k)>=k. Since f(k) is a decreasing function of k, the optimum is the largest k such that f(k)>=k. The main idea is to employ an array of indexes for the bases. The main idea is to sort the array of indexes using
i < j <=> (string< bases bases :start1 i :start2 j). The rest follows easily.
Curiously, the result obtained indicates that in the X chromosome of the Drosophila exists big portions that are shifted copies, providing some insight about the evolution of this species (I suppose so).
Finally, in ruby I try to sort the array of indexes using a similar idea but it was very slow. My attempt was to sort_by {|i,j| bases[i .. -1] <=> bases[j .. -1] }, but the sorting took a very long time compared to sbcl, perhaps there better way of sorting that I don't know in ruby.
* (load "mosquito.lisp")
longest-common-substr for k = 11505 is (11508 53395)
longest-common-substr for k = 11506 is 2639
Hence, the greatest k such that f(k) >= k is 11505
The substrings are located at positions 27441, 15935 and 39112
-----------------------------------------------------------------
Source code follows:
(defparameter bases (with-open-file (f "mosca.txt"
:direction :input
:element-type 'base-char)
(read-line f)))
(defun ordena()
(let ((anum (coerce (loop for i below (length bases) collect i) 'vector)))
(sort anum (lambda(i j) (string< bases bases :start1 i :start2 j)))))
(defparameter anum (ordena))
(defun mis(i j)
(let ((i1 (aref anum i))
(j1 (aref anum j)))
(- (mismatch bases bases :start1 i1 :start2 j1) i1)))
(defun longest-common-substr(k)
"Compute the longest common substring,
The parameter k is used to guarantee that each pair of substring are
at a distance greater than k. The condition f(k)>=k guarantees that
there is not overlapping so that the solution obtained is feasible.
Finally, since f(k) is a decreasing function of k, the
longest feasible solution is the greatest k such that f(k) >= k. Such
k can be found with a loop such as: while f(k)>=k increase k, but is
better to look for k using binary search.
Note that the ordering used for the indices implies that
mismatch(i, i+p) = k0 then mismatch(i,j)>= k0 for each j with i<=j<=k0.
"
(loop with maxi = 0 with pos = 0 with aux = 0
for i upfrom 0 below (- (length anum) 3)
for a = (aref anum i)
for b = (aref anum (1+ i))
for c = (aref anum (+ 2 i))
when (and (> (abs (- a b)) k)
(> (abs (- a c)) k)
(> (abs (- b c)) k))
do
(progn
(setq aux (- (mismatch bases bases :start1 a :start2 c) a))
(when (> aux maxi)
(setq maxi aux
pos i)))
finally (return (values maxi pos))))
(multiple-value-bind (maxlen location)
(longest-common-substr 11505)
(format t "longest-common-substr for k = 11505 is ~S~%" (list maxlen location))
(format t "longest-common-substr for k = 11506 is ~S~%"
(longest-common-substr 11506))
(format t "Hence, the greatest k such that f(k) >= k is 11505~%")
(format t "The substrings are located at positions ~D, ~D and ~D~%"
(aref anum location) (aref anum (1+ location)) (aref anum (+ 2 location))))
---------------------------------
Dedicated to the drosophila for her huge contribution for the advance of the biological sciences.