The client - a real estate investment firm - is involved in a spectrum of property-related projects which are very often served with custom-built software. The company, among other things, need to collect, process and analyse significant amounts of data gathered from real estate markets.
One of the client’s existing Apache Spark based software systems used for processing complex high-volume geo-located data sets caused the client to incur significant storage and processing costs. The goal of the project was to redesign the system architecture so that its costs could be substantially reduced. More specifically, the issue with the existing solution was that optimization was based on pre-processing data (grouping based on market polygon), which was later stored in HDFS. The client requested a new feature that required a different grouping strategy. The storage costs were already quite high and with the new grouping strategy they might easily double. On top of that, the new approach might as well lead to an increase in the time required to reindex all grouped data in HDSF - such an action was required due to the fact that market polygon was adjusted from time-to-time, which was beyond control.
All in all, the new solution architecture was to ensure efficient data processing and storage and, at the same time, drive significant cost savings.
With the new approach in mind, we proposed to design and implement a more flexible and cost efficient solution based on Apache Kafka. The latter plays the role of a single source of truth not only for the backend but also for the front-end part of the system (implemented by a third party in RoR and React.js).
In short - all the data collected from different services are fed into Kafka; more specifically, a set of dedicated microservices process the data, serialize them to the Avro format and then GZIP them to reduce their size. Yet another service materializes part of the data to ElasticSearch, where an RoR application can make use of them. A reindexing issue was solved by producing specific events to Kafka - an event informs indexing processors about new market update or market removal or new property addition, property geolocation update or property removal. Such a system modification allowed to reduce the need for reindexing to the minimum, with the process currently running in almost real time.
The biggest advantage gained enables the system to reuse all SQLs used by the RoR application. The result was achieved by programming a pool of workers (Kafka consumers) which run on AWS spot instances . Based on the events listed earlier the workers filter out Kafka topics (each with about 400M records) and select only the subset of data needed to be processed rather than reindexing the whole dataset. In the next step, given SQL queries are executed against the selected data subset. The Kafka-consumer-based approach allowed to reduce the costs of infrastructure by close to 90%.
In business terms, the Apache Kafka based modification of the system enabled the client to significantly reduce the report generation time, automated the process served by analysts driving the efficiency of their work, substantially decreased the costs of data processing and storage, and - last but not least - laid an architectural foundation for further developing the the client’s big data processing platform.