[GSoc Project] Search Engine Thread

104 views
Skip to first unread message

indianauthority97

unread,
May 19, 2017, 3:16:11 PM5/19/17
to Sigmah developement
Hi, this will be the thread for discussions on my GSoC Project entitled "Full-text search of database text and files, enforcing ACLs and search contexts.

I will be posting my week-by-week progress here, as well as a tentative schedule for the coming weeks.

Week 1: (Before Friday 05/19)
○ Detailed Mockups

Please find the mockups attached in png format here ( they are also attached in a zip ):

1. Basic Search Bar: 
















2. Search Results in a new tab:


3. Additional Permissions:


4.  Search Settings in Admin:


○ Clarify technical questions on Solr for generic technical design: Is it possible to schedule an automatic update of the Solr indexes?

Yes, it is possible to schedule the delta imports, but Solr has not included the DataImportScheduler in any of its releases and there is not much documentation since we can depend on the Operating System to run a cron job ( in Ubuntu ) or on Windows Scheduler to run the data import. I have referred to the following links:


Nonetheless I feel that providing a button from the UI and a permission for the Admin to update the index manually ( as I have shown in the 4th Mockup ) should be implemented first, since there is a risk that the scheduling of the delta imports may not work smoothly at all times, and the Admin may have to revert to the same.

------------------------------------------------------------------------------------------------------------------------------------------

I have already begun working towards the "Minimum Goal" outlined, and have implemented a rough version of the first UI Component:
Search bar -(top Sigmah header (near the offline menu)).

Screenshot:



Please review and comment on my code at my local branch here:


Currently, I am trying to make a new tab open out when the "Go" button is clicked.

Goals for the coming week:

Before Friday 05/26
- Design layout of Sigmah SolR configuration xml files ( schema.xml, data-config.xml ) and study of database to find essential fields needed to be indexed and stored
- Continue to work on UI

Thank you!
Sigmah_Mockups_Search.zip

Olivier Sarrat

unread,
May 22, 2017, 8:32:23 AM5/22/17
to sigma...@googlegroups.com
Hi Aditya,

Thanks for this good work !
I've replied to all questions regarding the Search Engine feature on its issue page: http://www.sigmah.org/issues/view.php?id=535

Have a nice day,

Olivier.
--
You received this message because you are subscribed to the Google Groups "Sigmah developement" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigmah-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

osarrat.vcf

indianauthority97

unread,
May 29, 2017, 2:32:36 PM5/29/17
to Sigmah developement

Hi, apologies for the late response, caused as I have been travelling.


Update on previous week's progress:


I have been working on the UI so far, and have not found enough time for implementing this goal:-


- Design layout of Sigmah SolR configuration xml files ( schema.xml, data-config.xml ) and study of database to find essential fields needed to be indexed and stored.


I will work on it this week for sure.


I put more stress on the UI as I feel I should resolve any issues related to the same as soon as possible before the official coding period begins, as this does not comprise the major portion of my project.


As of now, the rough search bar exists.

On clicking 'Go' without entering any text, nothing happens.

On clicking 'Go', after entering search text, a new tab opens out with a tab title "Search Results".

The title of the page is "Search Results for <searchText> ".



Screenshot:


https://lh3.googleusercontent.com/-s6tk3wjfvnk/WSvJR-sipcI/AAAAAAAAAQ4/ASKwJYdIMCEP9iVwZB0rI57G4b9ptr0OwCLcB/s1600/Screenshot%2Bfrom%2B2017-05-29%2B11-56-56.png



Code: https://github.com/sigmah-dev/sigmah/pull/97/commits/e3a44e5025ac306319775ef8fce5a1c5a4157ad3


Issues:


1.  Currently a unique tab opens out for all search results. If I want to open a unique tab for every search ( as asked for in http://www.sigmah.org/issues/view.php?id=535 ), I will have to continue working on this.


2. My plan ( for now ) was to update the title/contents of the page every time a new search is conducted. However due to the final nature of the class org.sigmah.client.ui.widget.panel.Panels, whatever the search text was on the very first search conducted, remains constant for the rest of the session. This is certainly very annoying and I am thinking of ways to resolve this.


Updated Mockups ( as requested in http://www.sigmah.org/issues/view.php?id=535 ):


PFA as a zip containing pngs/htmls.


Notes: "- can you add the search filters somewhere (even if not required in minimal version) ?" - I am not too clear as to what is being asked.
- I have not changed Screen 04 at all.
- Regarding Screen 3, I have let the index update button possibility open.


I think perhaps I cannot find out more about the scheduling issue until I have done some small pilots and completed atleast some part of the dataimport handler, and will be able to find out more details only when I have done so. Hence, I feel it is better to let this rest for a while.

Goals for the coming week:

Before Friday 06/2
- Design layout of Sigmah SolR configuration xml files ( schema.xml, data-config.xml ) and study of database to find essential fields needed to be indexed and stored ( not done in the previous week, will stress on this )
- Continue to work on UI ( less priority )

Thank you!
Sigmah_mockups_v2.zip
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted

indianauthority97

unread,
Jun 5, 2017, 11:47:14 AM6/5/17
to Sigmah developement
Hi, would just like to inform you that I'll be travelling again from 17th - 26th June ( have to attend a workshop in another part of the country ), and may be able to put in a few lesser hours, but I'll keep working :)
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted

Aditya Adhikary

unread,
Jun 5, 2017, 12:26:08 PM6/5/17
to sigma...@googlegroups.com
Update on previous week's progress:

1: Regarding the issue of opening multiple tabs( with titles as respective search text) for different searches:

I have managed to do this, please take a look at: 

However, a few bugs which persist are:
- On supplying multi-worded space separated search text, the tabs misbehave and all spaces are converted to "%20" and on reloading "%2520", or the title of the tab simply converts to null. Will deal with this after I have completed a better part of the back-end.
- The search results panel still remains frozen at the first instance created.

2. Design layout of Sigmah SolR configuration xml files ( schema.xml, data-config.xml, solrconfig.xml ) 

I have worked on this, keeping in mind that I am only indexing important "project" fields from the database for the present.
I have created a core called "Test_Sigmah" for testing the indexing . DataImport succeeds, and Solr Documents are being created successfully, queries are also running fine.

(To test it out on your system locally, download, install and test solr-6.5.0 on your system.
Create a core called Test_Sigmah. 
Go to solr-6.5.0/contrib/dataimporthandler/lib and paste the postgresql jar from here: https://jdbc.postgresql.org/
Go to solr-6.5.0/server/solr/Test_Sigmah and in it, paste the contents of the Test_Sigmah zip I have attached here.
Start the solr server, go to the admin UI, select core as Test_Sigmah and do a full import on it. ( Note: The password for my database is hamsig and user is sigmah )
Check if approx 35 documents are generated. You can also query them. )

3. I have been trying to use SolrJ to connect from Sigmah to the standalone Solr Server. Solr runs on port 8983 and Sigmah on localhost:8080 with tomcat.


My aim was to create a service on the client side which would asynchronously communicate with the server-side, where a connection will be made with the Solr server to query it(via SolrJ), and the response will be returned  to the client-side in the form of a shared DTO. I have used GWT RPC to achieve this.

I have tested this concept on a standalone GWT project and faced no issues, i.e the SolrDocument query response is rendered as a string and returned to the client-side successfully.

However, when I work with Sigmah using almost the same code, I run into trouble. I get the following error:

org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8983/solr/Test_Sigmah
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:617)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:279)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:268)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:160)
at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:942)
at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:957)
at org.sigmah.server.search.SolrSearcher.search(SolrSearcher.java:92)
at org.sigmah.server.search.SearchServiceImpl.search(SearchServiceImpl.java:14)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.google.gwt.user.server.rpc.RPC.invokeAndEncodeResponse(RPC.java:569)
at com.google.gwt.user.server.rpc.RemoteServiceServlet.processCall(RemoteServiceServlet.java:208)
at com.google.gwt.user.server.rpc.RemoteServiceServlet.processPost(RemoteServiceServlet.java:248)
at com.google.gwt.user.server.rpc.AbstractRemoteServiceServlet.doPost(AbstractRemoteServiceServlet.java:62)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:661)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
...
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:117)
at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:515)
... 50 more

I am not able to understand the problem here. Is it because Sigmah is using Guice Filter for all its servlets and does not allow a servlet registered in the following way in web.xml?

<!-- Solr search servlet -->
<servlet>
<servlet-name>SearchServiceImpl</servlet-name>
<servlet-class>
org.sigmah.server.search.SearchServiceImpl
</servlet-class>
</servlet>
<servlet-mapping>
<servlet-name>SearchServiceImpl</servlet-name>
<url-pattern>/sigmah/search</url-pattern>
</servlet-mapping>

If anyone can shed some light on this, I'd be grateful!

-------------------------------------------------------------------------

Next Week:

- Continue working on the Solr config xmls.
-Test the demo database.
-Continue working on solr connection with solrj
-Introduce automatic indexing.
-update installation and upgrade instructions to include installation of Solr

Thank you!

On Mon, Jun 5, 2017 at 9:34 PM, indianauthority97 <adity...@iiitd.ac.in> wrote:
Hello, I seem unable to post more than a few lines to this group any more. It gives me a "There was an error posting the message to the group. Please try again later." Cannot post this week's report due to the same :(

--
You received this message because you are subscribed to a topic in the Google Groups "Sigmah developement" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sigmah-dev/AIs30oa-kMI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sigmah-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Aditya Adhikary

BTech, CSE
2015007

Test_Sigmah.zip

indianauthority97

unread,
Jun 14, 2017, 11:28:07 AM6/14/17
to Sigmah developement
Apologies for the late report.

Update on previous week's progress:

1) Continue working on the Solr config xmls.

I have suitably updated these configuration files to include the indexing of Contacts and OrgUnits. I discovered that it is not possible by convention to include multiple types of documents ( i.e from different tables ), but found the workaround after referring here: https://lucidworks.com/2011/02/12/solr-powered-isfdb-part-4/

While testing, I discovered an issue regarding the creation of sub-entities on integer fields ( requiring a join to create the sub-entities ) which have null values. As a workaround I have stored the values of null-valued fields as 0.

Now we can easily query Solr for projects or contacts which have the word "sigmah" featuring in them. Also, I can easily filter out queries only pertaining to projects, or orgunits, or contacts by adding an "fq" parameter to the query, such as:

fq=doc_type:PROJECT

Please find attached the updated solr config files ( Note: I have changed the database to be linked to, to the demo database named 'sigmah_demo', under user 'sigmah, with password 'hamsig' )

2) Test the demo database.

Done. However, I still feel there are many null fields in the database, and I would be really glad if there were more text fields in tables like contact, org units etc that would make it easier for me to test.


3) Continue working on solr connection with solrj

The error occurring the previous week was due to a local Java.netConnection error. On my Linux 16.04 system, I first have to switch off the WiFi in order for the Solr Connection to take place, otherwise ( for now ) I have made sure a pop-up signalling a "Failure to connect to Solr Server" shows up while searching. Also, the connection occurs only after the first search string has been sent, and I am working on it.

If this error occurs on other systems as well, I shall have to look deeper into the matter later on to resolve it. 

4)  Introduce automatic indexing.

I have not worked on this, as I first thought it would be easier to implement a temporary button which can be manually clicked to index the database. Currently this is sending a command to the Solr Server to perform a full-import, and it is managing to do this successfully via an asynchronous method in the SearchService class.

On the matter of delta-imports I have found that it is necessary to modify the Sigmah schema to include a 'Last Updated Time' column for each table which I would be using in my solr indexing to compare with the time of the last delta import.

However, in case I am not able to add any such column to the schema of Sigmah due to complexity issues, it may be possible to do this with a workaround described here : 


However I am not very sure about this and would like your opinion on this.

5) update installation and upgrade instructions to include installation of Solr

- To be discussed.

6) Presentation of Results on the Screen 

Last week I had only managed to bring the solr documents queried to the client side successfully, and was able to confirm this using Window.alert messages. However this week I managed to bring the results to the main Application View. Now on searching, the results show up in different tabs in a grid format ( Note that the strings themselves are not parsed and beautified, will work on that later ). 

Here are some screenshots:



A major pain in the neck was that once the first search result was displayed, the view instance was getting frozen for all other search instances/tabs/views.


In the end I realized that due to the version of GXT used in Sigmah (2.x)  it is necessary to do a Component.layout() after every modification I made in it. This caused quite some frustration and took some amount of time to realize.


Next week:

- On clicking a search result link, a project/contact/orgunit tab opens up ( minimal goal : restricted to projects )
- Perform filter on project/contact/orgunit queries from front-end
- Look into delta imports after discussing with mentor
- Auto index scheduling attempt
- Better presentation of the search results
- More goals after discussion with mentor

A gentle reminder that I'll be travelling from 17th - 26th.

Thank you!
Test_Sigmah.zip

indianauthority97

unread,
Jul 6, 2017, 7:23:15 AM7/6/17
to Sigmah developement
Update on previous 2 weeks' progress ( 1 week delayed due to before-mentioned workshop ):

- On clicking a search result link, a project/contact/orgunit tab opens up

Achieved, demonstrated in the Skype meeting. However, for some search results, the permissions implemented were disallowing the tab from opening up, giving a "You are not authorized" popup, and the tab was freezing on the tab panel with a "Loading..." title.  

- Better presentation of the search results

As discussed, I have added an image to better represent a search result as a project/orgunit/contact. Here are some screenshots:
 





- Major changes in the solr xml files -

- I discovered a lot of errors I had committed while configuring the solr config files, most importantly that the table "partner" was not being used for storing the major chunk of information related to org units. 
- I have also made suitable changes in the schema.xml text_general field type to include more filters for indexing and querying, for example:
  - Stopwords filters
  - Stemming ( both English and French )
  - Elision filters
  - ASCII character filters

The searching has improved a lot now and we can search for partial words, words without accents, and so on.
I am attaching the changed config files here.

- Perform filter on project/contact/org unit queries from front-end

Achieved, demonstrated in the Skype meeting.

- Look into delta imports - TBD
- Auto index scheduling attempt - TBD

This week, I have been looking into the permissions aspect. I made one attempt at early binding ( in which I was filtering out the results in the server side itself , i.e trying to use the existing permissions of a particular user to fetch the projects which he has access to and compare these with the solr results, and then filtering out those results which do not match ). However, this failed due to the reason that some functions to fetch the dao from the database were deprecated.

Thus, I fell back on late binding ( filtering on the client side itself just before final presentation of results ). Now, I fetch the projects/org units' IDs which a user has access to using an asynchronous command ( already existing in the codebase ) and then compare these Ids with the Ids of the projects/orgunits fetched by the Solr Query. So far I have succeeded for Projects and OrgUnits. Due to lack of enough database entries however, I am not able to test properly, and will have to create more dummy organizational units and projects.

Another question I want to ask is: Is the user capable of viewing deleted projects/org units? 

Goals for this week/coming weeks:

- Above mentioned ones incomplete from last week
- Facets ( Highlighting of Search Results )
- Auto-Suggestions and Relevance Ranking
- Canned queries
- Look into the files aspect, using Apache Tika

I'd like to thank my mentors for reviewing my work and giving valuable feedback in the first evaluation so that I can continue working with focus! :)

Regards,
Aditya.
Test_Sigmah.zip

indianauthority97

unread,
Jul 24, 2017, 7:55:04 AM7/24/17
to Sigmah developement
Update on previous 2 weeks' progress:

- Contacts permission filtering: Done.


- Added a property in the pom.xml for sigmah core url:

I had earlier hard-coded this, but it can now be done from the front-end.


- Automatic indexing: Done.


The admin( who has access to the settings tab in the admin console ) has to load his dashboard once, and the automatic indexing will start.

- Work on automatization and documentation of installation and running the search engine:-

Organizations have 2 options:

1. They run their own Solr Server- I have written a partially complete documentation for the installation of Solr and the config files here: http://www.sigmah.org/issues/view.php?id=535, which I hope you would have been able to replicate on your systems.

2. Connect to a central Sigmah server - I have thought of this in the following way - The central server runs a single Solr instance with multiple cores. Each core has a url associated with it, for example on my local system, 

Each core is associated with an organization. The job of the admins at the Sigmah central server is to setup this core ( consisting of all the steps to setup a new core in solr and configure the xmls for it ) and provide its url to the admins of the organization. ( I assume the shared databases also reside on the central server). The admins of the organization then simply connect to that url from the front-end, and have the ability to perform manual indexing as well as auto-indexing.

With this in mind, I have built the following UI in the admin console
( Note that whoever has permission to view this settings tab in the admin console has access to the solr settings ):



The admin of the org enters the url in the textbox and saves the changes, which is then stored in the database as an additional column in the "Organization" table. I migrated the previously used "DIH" button to a "Manual Index" button in this panel (still need to bring it to the centre though :P )


Problems/Doubts:
1. The method used to connect to my local solr server is http. However, I am not too sure if http is used while connecting the shared instance to the central server ( database, for instance ), due to security issues. If https is used, then there may be some trouble in connecting the shared instance to the central solr server.
2. Other network and connectivity issues:- Since I can only test out this on my local system, I cannot ascertain that there will be other issues involved such as data loss and connection timeouts etc.
3. There is more work for a Sigmah Central admin, to set up a core for each organization. 

- Work on the files aspect of search:

I have tested out the indexing of files on a pilot scale in another project, and it works. However it has been hard to figure out how to do this from the sigmah codebase, and also to fetch the results of a query (i.e a downloadable link in the search results), and I am currently tackling all these. 

- Facets ( Highlighting of Search Results ), Auto-Suggestions and Relevance Ranking, Canned queries :- TBD, I have given higher priority to files and improving search.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

indianauthority97

unread,
Aug 13, 2017, 10:51:25 AM8/13/17
to Sigmah developement
Hey everyone!

This is probably the last report I'll be writing for my GSoC project. Can't believe 3 months are over already!

Update on previous 2 weeks' progress:

1 ) Completed the files aspect of searching

-Indexing of files is working properly, as far as I tested.


-Presentation of files search results:




Each result has a title, an author field, an "attachment" image to distinguish it from other types of results, and the first few hundred characters from the text.
On clicking a file search result, the file downloads to the machine ( same behaviour as the "Reports and Documents" ) section.

-Permissions filtering of files: I have not been able to achieve this to a satisfactory extent. As of now, I am unable to implement a better filter, so I have done a 'rudimentary' one:

By "rudimentary" files results filtering, I mean that only those files which the currently logged in user is an author of are being shown in the search results. On clicking these results, the file successfully downloads. This is the best permission I am able to implement for now ( and is partially useful at the same time ), due to the complexity of the code. For future, a better permission has to be implemented.


To make this clear to the user I have renamed the option "Files" with "Your Files" in the dropdown list in the searchbar.

I also completely re-structured the Solr automatic indexing I had written earlier to use the existing command pattern within Sigmah codebase, as well as include indexing of files to the existing full data import facility.


2) Made multiple changes to UI 

-Completely changed the search bar to use GXT elements instead of GWT, giving it a better look and feel.

-The title of all the search results is now in a brown SIgmah uses ( as opposed to the earlier blue ).

-I also made removed the "FIndex" button I had put for testing purposes.

-Added a label about the automatic indexing in the Sigmah Solr settings part, above the "Manual index" button

-Got the Manual index button to appear in the centre of the panel.

- Got the icon on the same row of the search result

- Also added a lot of fields to show to the user in the search results of Projects, Contacts and Orgunits.




- Added an image to the search button instead of the earlier "Go" text.


3) Added error messages for many different cases:

- On logging in, if there is no connection to the Solr server 
- When trying to carry out search without connection to the Solr Server
- After updating the solr core url in the admin, if a connection is not able to establish between the sigmah and the solr server OR
if the url entered is bad. ( If it is able to connect successfully, it gives a message too )
- When there are no search results or the user does not have permissions to view the search results.


4) Cleaned up code, added comments


For some reason, It seems there is a jdk version error on Travis, i.e I am compiling with JDK 1.8 and the builder is running tests with JDK 1.6 or 1.7 : https://stackoverflow.com/questions/23249331/java-unsupported-major-minor-version-52-0

That is the reason it is not able to pass the tests.

I have even tried with JDK 1.7 on my machine. However, the issue seems to be in the tests only and not in the compilation. It could be that an old version of jdk was used for writing those tests.
When skipping the tests, it compiles successfully. 

Also, my codacy errors (https://www.codacy.com/app/osarrat/sigmah/pullRequest?prid=662493) are very minor and I have resolved almost all of them. I hope my code is worthy of merging now.

5) Added a Global Permission SEARCH to the profiles with which a user can search if it is ticked: ( default - unticked )


6) Added a wiki in the Sigmah administrator guide at the bottom of the page:


In the final week, I plan to write the technical documentation and do a few final brush-ups to my code, and a final report for Google.

I hope my work has been satisfactory and really wish to see Sigmah with my search feature included as soon as possible!

Thanks and regards,
Aditya
Reply all
Reply to author
Forward
0 new messages