profile picture
Aman Mangal
Software Engineer at Dgraph Labs
Koramangala, Bengaluru, KA, India

hangout:   mangalaman93
Email:   ⟨firstname⟩

Brief Biography

I am currently working at Dgraph Labs as a Software Engineer with focus on building High Performance and Distributed Systems.
Please find my resume here, checkout my github profile.

Professional Experience

Software Engineer, Zeotap, Bengaluru, KA, India
My focus at Zeotap has been on building, scaling, optimizing and stabilizing various data pipelines and systems. Zeotap deals with billions of records everyday and requires processing terabytes of data fast and efficiently. I built various components of our data pipelines using Spark and Oozie while ensuring consistency of data even when jobs fail. We optimized one of the layers in our data pipeline and reduced its run time from 4 hours to only 20 minutes by correctly identifying the bottleneck. I also worked on setting up and optimizing Redshift, to enable creating segments and run adhoc analysis queries efficiently. I migrated one of our data stores as well as all the data pipelines around it, from Aurora to Aerospike improving performance from 10,000 QPS to more than 300,000 QPS.

Apart from this, I have spent my time on building and implementing algorithms for attributing revenue back to our data partners. I was able to optimize the algorithm so that it could finish within desired SLA. I also built a system that can estimate size of adhoc segments on billions of records with sub-second latency. This required building a customized in-memory database specifically designed to run queries that can handle flat as well as nested data. The system was implemented using Go for achieving efficiency while ensuing high uptime using fail fast and supervisor tree ideas from Erlang programming language.

I also helped in building a collaborative culture in our Data Engineering team. I started peer code review process on Github to improve the code quality and increase collaboration within the team. To improve communication across teams, I stressed upon the idea of internal tech-talks. I wrote the first engineering blog post at Zeotap to increase exposure outside Zeotap and encouraged everyone in the team to write so that we can better realize and communicate our ideas.

Data Engineer, Lumiata Inc, Mumbai, MH, India
At Lumiata, I worked on scaling the existing Spark cluster to possibly hundreds of nodes by containerizing all the services using docker and performing manual orchestration using docker-compose. I also setup a secure docker registry server and a UI server to browse through docker images. Before the India office was closed down, I was looking into setting up a Kubernetes cluster for running multi node Spark clusters.

Member of Technical Staff Intern, Zerostack, Mountain View, CA, USA
I implemented support for adding external storage to ZeroStack using OpenStack Cinder drivers whish is extensible to future Cinder drivers with minimal changes to ZeroStack. The operation was atomic and followed eventual consistency model as well as it could recover from unexpected crash or power failures. Towards the end of my internship, I also helped in automation of setting up kubernetes clusters for customers.

Summer Intern, Bell Labs, Murray Hill, NJ, USA
We developed a monitoring infrastructure for Linux containers running on a cluster of physical servers using collectd, influxdb (a time series database) and grafana (for visualization of various metrics). We, then, compared various load prediction and container packing algorithms used for server consolidation in a cloud environment, using the technique of Discrete Event Time Simulation. We picked a specific set of containers streaming an arbitrary number of pixel streams used inside Bell Labs and analyzed their workload. To host a large number of such containers on a minimum set of physical servers while still maintaining an acceptable level of user response time, we needed to compare algorithms for load prediction and container packing. We, therefore, developed a Discrete Event Simulation infrastructure and showed that the simulation workload is close to real workload.
code view related paper


Research Experience

Raft Consensus Algorithm for Data Center Network
Data centers are becoming increasingly popular for their flexibility and processing capabilities. They are managed by a single entity and used by thousands of users. Centralized administration makes them extensible, predictable and more reliable, making them a good environment for innovative distributed applications. The flexibility and stronger guarantees provided by the data center network allow us to co-design distributed systems with their network layer and optimize their overall performance. In this project, we have analyzed the performance of the Raft consensus algorithm in the presence of heavy network load. We show that the degraded system performance can be significantly improved by co-designing the system with the network layer. We evaluate our idea on a 5 node cluster of etcd key-value store, setup on banana pi boards connected in a star topology. We also show how we can expose the network layer improvements to the application in a transparent manner in order to simplify application development.
code view source code report view report

Orchestration of Network Functions in Cloud (NFV)
The Recent emergence of Network Function Virtualization (NFV) technology has provided greater flexibility for applications running in the cloud in terms of placement, scalability etc. At the same time, it has led to a rising need of sophisticated mechanisms and orchestration tools to achieve bounded latency and throughput due to buffering in software queues, slow processing compared to hardware network functions (NF), software context switching and unpredictable software behaviours due to cache etc. The goal is to develop tools to provide given Quality of Service (QoS) which may be defined either in terms of end to end latency, response time or throughput in the presence of software NFs while using the placement, resource allocation, flow routing as control parameters. We have developed an infrastructure to run network functions as docker containers, networking is setup using OpenStack, nova-docker and openvswitch and routes are setup using an SDN controller such as OpendayLight.
code view source code

Optimizing Short Read Error Correction on GPU
Next generation sequencing (NGS) technologies have revolutionized the way of analyzing genetic information. It has eliminated the limitations of previous techniques in terms of throughput, speed, scalability and resolution. The effectiveness of NGS technologies is further improved by using error correction methodology on the imperfect though redundant sequence reads. In this project, we have utilized our HPC expertise gained in the CSE 6230 class and improved the run-time of GPU based solution (CUDA-EC) to this problem. We propose a parallel implementation for the same algorithm which reduces the kernel execution time by approximately 14.81% which we have achieved by increasing warp efficiency from 3.5% to 67.5% and a few other code optimization. We also show that multiple-threads-one-read model has enough task parallelism to exploit on GPU compared to classical approach of one-thread one-read model.
code view source code report view report

Fault Tolerance in Software Defined Networks
The Software Defined Networking paradigm introduces the concept of separation of data plane (forwarding plane) and control plane as well as centralization of algorithms that make routing decisions. Centralization greatly simplifies management of network by simplifying policy specification, modification, implementation and verification etc. Although, significant research has been conducted on SDN in recent years, there has not been enough discussion on how to deal with age-old yet common problem of network failures. Centralization also introduces more points of failures such as controller failures in the network. In this paper, we consolidate existing work in the domain of fault tolerance in software defined networks. We, then, define the problem of link failures and the constraints we faced while solving it. We propose an algorithm for proactively handling link failures. We have implemented our solution using pyretic (a SDN controller).
code view source code

DEBS Grand Challenge 2014
Smart grids are becoming ubiquitous today with proliferation of easy to install power generation schemes for Solar and Wind energy. The goal of consuming energy generated locally instead of transmitting it over large distances calls for systems that can process millions of events being generated from smart plugs and power generation sources in near real time. The heart of these systems often is a module that can process dense power consumption event streams and predict the consumption patterns at specific occupational units such as a house or a building. It is also often useful to identify outliers that are consuming power significantly higher than other similar devices in the occupational unit (such as a block or a neighborhood). In this paper, we present a system that can process over a million events per second from smart plugs and correlate the information to output both accurate predictions as well as identify outliers. Our system is built from the ground up in C++ rather than using sophisticated complex event processing engines or any other commercial product which is why we believe we have achieved very high throughputs with very low CPU capacity for processing the events. Our results show that the throughput of our system is over a million events per second while using under 20% of one core.
code view source code paper view paper

Erlang Distributed File System
Erlang is recently developed general purpose highly concurrent functional programming language. It was designed to support distributed, fault tolerant, scalable, non-stop applications. It has been used in production systems (e.g. AXD301 ATM switch) with an uptime percentage of 99.9999999% (nine nine's). It is being used by Facebook, Github, Riak, Amazon etc. to develop large distributed systems. We have leveraged the distributed capabilities of Erlang and have developed yet another DFS namely Erlang Distributed File System (eDFS). It is highly concurrent, reliable, scalable and fault tolerant. In this report, we first develop small taxonomy of DFSs. We, then describe the architecture of eDFS and compare the design with the existing DFSs.
code view source code documentation view report

A Long-Term Profit Based Approach For Dynamic Server Consolidation in Virtual Machine Clusters
Dynamic Server Consolidation (DSC) is a technique used by cloud owners to consolidate multiple virtual machines (VM) that run client application components, onto a minimal number of physical machines (PM). Dynamic change in the VM-to-PM mapping is achieved by exploiting the live migration capability of VMs. In this paper, we propose an approach which is based on two paradigms: first, that inputs to the DSC algorithm should be actual revenues and costs, and no artificial cost should be provided. Second, that a consolidation decision must take long term benefit into account. We first propose a profit model in which costs of actions such as VM migration decisions manifest as SLA penalties and increased operating costs. We then use this profit model to formulate this problem as a Markov Decision Process (MDP). We do this by narrowing our scope to workloads that are cyclic in nature. Although the proposed approach is not practical yet due to prohibitive complexity of the solution, preliminary results show that there is good potential in the approach and further research in this direction will be worthwhile.
code view source code

Unified Call Distributor, UCD (Novanet, Ltd)
Unified Call Distribution System (UCD) is a distributed system which routes requests to specific group of agents based on their skill set in a proportionate manner. It is an extension of OpenACD but is independent of the channel associated to any request and therefore, called unified. The system maintains sessions of agents, connects clients to the agents for communication, shows the current status on monitor etc. The client can communicate to agent on any kind of channel such as chat, voice call, email or through any social media network. The task of routing is carried by an integral component called Call Distributor (CD). It is developed in Erlang. The other component units Channel Servers (CS) and Agent Servers (AS) maintains sessions of agents or clients and feed data to CD. It, in turn, maintains real time data in mnesia database and performs the task of routing. The real time data includes all the agents logged in, queues of requests for each channel etc. Every request is assigned to an agent after matching mandatory properties, counting desired properties and at last based on "most idle" agent definition. Agents can be configured to handle multiple requests at a time. Such a channel configuration is called parallel channel. We keep mnesia tables with hashed names where key is properties of agents and value is list of free agents in order to speed up the routing process.



Last updated in May 2017