Let’s say we want to add a full-text search engine to you application. Apache solr is a open source and popular choice for search engines. Now, I am going to share a simple architecture to synchronize data between the primary data-store (like mysql, mongodb, etc) and solr engine.
There are 2 basic problems that we face when we need to synchronize data between 2 separate data-stores.
- Synchronizing changes into solr when primary data changes (via add, update, delete, etc)
- Rebuilding the whole solr index. This is required when you initially build the index for existing data, or when type of a search field changes (index and search mechanism changes).
Solving the first problem:
There are many ways we can do it. I prefer event based solution for monolithic applications. You can build a simple event based framework or use libraries like “reactive extensions”. There idea is to publish events whenever primary data changes. And you need some code to listen to such events and update solr store. There is a very good reason not to update the solr in the same execution flow/thread as primary change. I always try to think in the asynchronous flows.
The second problem:
This one is inherently more resource-consuming. If you have a huge amount of data, you will need to breakup the synchronization processes. There are many ways to do it. One of the simplest way would be a scheduled job which processes a batch of data. Normally I sync 100-1000 records or documents. You can also set schedule hours so that it does not keep resources while in busy times.
I used a little bit more sophisticated solution for my recent mongoDB and solr based document search engine in spring and RxJava.