Based in Milan, Italy, Diary of a Digital Engineer is a blog by Stefano Sandrini. Here I post about my journey as a software engineer and solution architect, new ideas, thoughts and quotes by influential architects, engineers, and digital innovators.

AWS Search Engines

In this post I will expose a brief introduction of the two solutions on AWS regarding search engines: Elasticsearch and CloudSearch.

I found myself in the needs of choosing which of these two technologies to use in 80% of our projects, and I thought it would be nice to share my humble opinion and experience.

Elasticsearch is an opensource, indipendent, project developed by elastic.io. It's famous and widely adopted: statistics say Elasticsearch is just behind Solr in terms of adoption as search engine. If you go with Elasticsearch, you will have a wide community of developers and adopter and very good documentation. The service also comes with Logstash and Kibana as well.

CloudSearch is a product developed by AWS itself and it offers a fully managed solution, which sounds great. You can choose, for example, between single or multi-AZ replication and you don't have to think about software updates, it's all delegated to AWS.

Which is the best? I think the answer is ... it depends :) It's all about your real needs, the performance you need (on search or on updating objects) and if you want to be more or less free to manage your search instances.

Let's have a closer look.

Massive data import / Export

Elasticsearch had the ability to import data with a featured called "river" for versions before 2.0,  now you can use integrations and importers.

CloudSearch has the ability to import data from JSON or XML format within "batch imports", or from S3 file (by setting a valid path).

Data Backup and Restore

Elasticsearch on AWS takes daily automated snapshots of the primary index shards in domain.  You can find more info about it in the AWS page "Configuring Snapshots". The downside is that you must contact the AWS Support team to restore an Amazon ES domain with an automated snapshot. Anyway, as AWS reports in its documentation, "If you need greater flexibility, you can take snapshots manually and manage them in a snapshot repository, an Amazon S3 bucket".

CloudSearch takes care of the whole backup process, and same is for the restore activity.

Cluster Management, scaling options and High Availability

With the Elasticsearch service on AWS you can define cluster's configuration via the web dashboard or the CLI. You can configure number of nodes and instance type. Also, you can enable the "zone awareness" flag, so your instances will be spread within two Availability Zones . Anyway, if you enable this feature, you should use Elasticsearch APIs to setup replicas for your clusters. Last, but not least, you can configure the storage option (storage type,  volume type and volume size). On Elasticsearch, indexes are split in several "shards" in order to have redundancy, so when a node fails, the shards are used to replace lost data.

On CloudSearch, cluster is entirely managed by AWS both from a "scaling" and the "availability" perspective. For example, when a node reaches its threshold, it is automatically upgraded to the next larger instance type. Also, when the capacity goes beyond the largest available instance type, the index is partitioned into multiple instances. 

As the same for Elasticsearch, CloudSearch offers a "multi-AZ" deployment option, for HA and AZ failover. Also, it is important to remember that CloudSearch is fully integrated with Cloudwatch monitoring tools.

Search, Indexing and integration 

Elasticsearch offers several libraries and clients, but I usually perform integration via the RESTful APIs. I think they are pretty exhaustive and you can do almost everything you need, from searching to indexing.

CloudSearch expose RESTful APIs and many SDKs (for Java, PHP and Node.js).

Cost

Princing model is different between the two solutions. You will pay Elasticsearch based on instance type and monthly usage, storage type and size and monthly data transfer. CloudSearch has a pricing plan based on instance size, batch uploads (every 1000 requests), number of re-indexing requests and data transfer out.

Final considerations

My guideline is to choose Elasticsearch when you need more freedom and flexibility and if you want to take advantage of a large community of developers and integrators.

On the other side, choosing CloudSearch is a complete no-brainer if you need top operational efficiency due to its fully managed offering.

Run shell commands on a EC2 from a Lambda function

Deploy Wordpress with Docker Compose on Amazon ECS