At Arista Networks, I have been managing the Zero Touch Provisioning (ZTP) package.
ZTP is one of the components of Cloudvision Portal (CVP) which provides a solution
to automate networking operations and monitor networks for enterprises and universities.
ZTP allows onboarding devices quickly and smoothly on to CVP without requiring any
manual intervention.
ZTP has two components, one is on the device/router side in C/C++ and the CVP side
component written in Java/Go. I have fixed various ZTP bugs that have been there since
its inception both in the product as well as tests (ctest, unit/integration test).
I have also led work for adding various new features such as adding support for
certificates without SAN IP, adding support for hardware authentication using TPM 2.0
devices, adding support for web proxy in ZTP.
Recently, Arista has begun its cloud offering of Cloudvision Portal, CVP as a Service
(CVaaS). For ZTP, in the context of CVaaS, the goal has been to provide a smooth and
secure onboarding experience while making sure that right device shows up for the
right tenant. This required adding support for device embedded token for initial
authentication, making hardware authentication using TPM work in cloud context, setting
up a pipeline from sales database for mapping device to right tenant, talking to
redirector for finding the right cluster for a tenant, enabling onboarding using a USB
or a custom bootstrap script.
I have also been a crucial part of implementing the Ownership Voucher Grpc Service
for Secure ZTP (RFC 8572). Ownership Vouchers allow device onboarding even in less
secure context by setting up a root of trust using the vendor issued vouchers. This
required coordination among various teams, designing the APIs for grpc service, figuring
out authentication and authorization for users, figuring out APIs for getting the
voucher signed securely using Arista's root of trust.
My focus at Dgraph was to improve the performance, fix bugs and add new features
to the database. I added the feature for Upsert and Multiple Mutation in the query
language by modifying the compiler and execution engine. I improved the performance
of indexing multi folds and reduced the memory usage by using disk for temporary data.
I also laid down
initial work
to show that the Go ecosystem needed a high performance cache and existing caches are
not enough which eventually led to
Ristretto.
I also enabled Dgraph to compute indexes in the background while still maintaining strong
consistency guarantees. This allowed Dgraph to perform mutations and queries even while
indexing is going on. I was able to logically prove that all the index changes will be
correctly computed for mutations applied before or after indexing command was issued.
All my contributions to Dgraph can be found
here.
Apart from that, I was the founding engineer of the team in Bangalore and helped grow the
team from myself to 16 people when I left. I also ended up adding couple of articles to the
pool of excellent
blogs that Dgraph has.
My focus at Zeotap has been on building, scaling, optimizing and stabilizing various data
pipelines and systems. Zeotap deals with billions of records everyday and requires
processing terabytes of data fast and efficiently. I built various components of our data
pipelines using Spark and Oozie while ensuring consistency of data even when jobs fail.
We optimized one of the layers in our data pipeline and reduced its run time from 4
hours to only 20 minutes by correctly identifying the bottleneck. I also worked on setting
up and optimizing Redshift, to enable creating segments and run adhoc analysis queries
efficiently. I migrated one of our data stores as well as all the data pipelines around it,
from Aurora to Aerospike improving performance from 10,000 QPS to more than 300,000 QPS.
Apart from this, I have spent my time on building and implementing algorithms for attributing
revenue back to our data partners. I was able to optimize the algorithm so that it could
finish within desired SLA. I also built a system that can estimate size of adhoc segments on
billions of records with sub-second latency. This required building a customized in-memory
database specifically designed to run queries that can handle flat as well as nested data.
The system was implemented using Go for achieving efficiency while ensuing high uptime
using fail fast and supervisor tree ideas from Erlang programming language.
I also helped in building a collaborative culture in our Data Engineering team. I started
peer code review process on Github to improve the code quality and increase collaboration
within the team. To improve communication across teams, I stressed upon the idea of internal
tech-talks. I wrote the first engineering blog post at Zeotap to increase exposure outside
Zeotap and encouraged everyone in the team to write so that we can better realize and
communicate our ideas.
At Lumiata, I worked on scaling the existing Spark cluster to possibly hundreds of nodes by
containerizing all the services using docker and performing manual orchestration using
docker-compose. I also setup a secure docker registry server and a UI server to browse through
docker images. Before the India office was closed down, I was looking into setting up a
Kubernetes cluster for running multi node Spark clusters.
I implemented support for adding external storage to ZeroStack using OpenStack Cinder
drivers whish is extensible to future Cinder drivers with minimal changes to ZeroStack. The
operation was atomic and followed eventual consistency model as well as it could recover
from unexpected crash or power failures. Towards the end of my internship, I also
helped in automation of setting up kubernetes clusters for customers.
We developed a monitoring infrastructure for Linux containers running on a cluster of physical
servers using collectd, influxdb (a time series database) and grafana (for visualization of
various metrics). We, then, compared various load prediction and container packing algorithms
used for server consolidation in a cloud environment, using the technique of Discrete Event
Time Simulation. We picked a specific set of containers streaming an arbitrary number of pixel
streams used inside Bell Labs and analyzed their workload. To host a large number of such
containers on a minimum set of physical servers while still maintaining an acceptable level of
user response time, we needed to compare algorithms for load prediction and container packing.
We, therefore, developed a Discrete Event Simulation infrastructure and showed that the
simulation workload is close to real workload.
view related paper
As part of a 3 day company wide hackathon, we (a team of 4 engineers) replaced an
external service (Spotinst) used to scale AWS Elastic Map Reduce (EMR) clusters with
spot nodes by a custom built service saving more than $50K/year. We utilized insights
such as container pending, memory allocated, container running from the yarn cluster
scheduler in order to provide shorter job completion times (by faster scale up),
better visibility and proactive alerts.
Zscaler scales up an EMR cluster when jobs are queued on yarn scheduler using spot nodes.
It chooses a set of nodes by maximizing a cost function (a non linear LP) that combines
node availability, node interruption probability and cost of running the cluster.
Once jobs are complete, it will remove the nodes to ensure that the cluster doesn't
run idle. If, for some reason, zscaler is not able to add nodes to the cluster, it will
notify the cluster admin for a manual intervention.
view hackathon presentation
Data centers are becoming increasingly popular for their flexibility and processing capabilities.
They are managed by a single entity and used by thousands of users. Centralized administration
makes them extensible, predictable and more reliable, making them a good environment for
innovative distributed applications. The flexibility and stronger guarantees provided by
the data center network allow us to co-design distributed systems with their network layer
and optimize their overall performance. In this project, we have analyzed the performance of
the Raft consensus algorithm in the presence of heavy network load. We show that the degraded
system performance can be significantly improved by co-designing the system with the network
layer. We evaluate our idea on a 5 node cluster of etcd key-value store, setup on banana pi
boards connected in a star topology. We also show how we can expose the network layer improvements
to the application in a transparent manner in order to simplify application development.
view source code
view report
The Recent emergence of Network Function Virtualization (NFV) technology has provided greater
flexibility for applications running in the cloud in terms of placement, scalability etc. At
the same time, it has led to a rising need of sophisticated mechanisms and orchestration tools
to achieve bounded latency and throughput due to buffering in software queues, slow processing
compared to hardware network functions (NF), software context switching and unpredictable software
behaviours due to cache etc. The goal is to develop tools to provide given Quality of Service
(QoS) which may be defined either in terms of end to end latency, response time or throughput
in the presence of software NFs while using the placement, resource allocation, flow routing
as control parameters. We have developed an infrastructure to run network functions as docker
containers, networking is setup using OpenStack, nova-docker and openvswitch and routes are
setup using an SDN controller such as OpendayLight.
view source code
Next generation sequencing (NGS) technologies have revolutionized the way of analyzing
genetic information. It has eliminated the limitations of previous techniques in terms of
throughput, speed, scalability and resolution. The effectiveness of NGS technologies is
further improved by using error correction methodology on the imperfect though redundant
sequence reads. In this project, we have utilized our HPC expertise gained in the CSE 6230
class and improved the run-time of GPU based solution (CUDA-EC) to this problem. We
propose a parallel implementation for the same algorithm which reduces the kernel execution
time by approximately 14.81% which we have achieved by increasing warp efficiency from
3.5% to 67.5% and a few other code optimization. We also show that multiple-threads-one-read
model has enough task parallelism to exploit on GPU compared to classical approach of one-thread
one-read model.
view source code
view report
The Software Defined Networking paradigm introduces the concept of separation of data plane
(forwarding plane) and control plane as well as centralization of algorithms that make routing
decisions. Centralization greatly simplifies management of network by simplifying policy
specification, modification, implementation and verification etc. Although, significant research
has been conducted on SDN in recent years, there has not been enough discussion on how to deal
with age-old yet common problem of network failures. Centralization also introduces more points
of failures such as controller failures in the network. In this paper, we consolidate existing
work in the domain of fault tolerance in software defined networks. We, then, define the problem
of link failures and the constraints we faced while solving it. We propose an algorithm for
proactively handling link failures. We have implemented our solution using pyretic (a SDN controller).
view source code
Smart grids are becoming ubiquitous today with proliferation of easy to install power
generation schemes for Solar and Wind energy. The goal of consuming energy generated locally
instead of transmitting it over large distances calls for systems that can process millions
of events being generated from smart plugs and power generation sources in near real time.
The heart of these systems often is a module that can process dense power consumption event
streams and predict the consumption patterns at specific occupational units such as a house
or a building. It is also often useful to identify outliers that are consuming power significantly
higher than other similar devices in the occupational unit (such as a block or a neighborhood). In
this paper, we present a system that can process over a million events per second from smart plugs
and correlate the information to output both accurate predictions as well as identify outliers. Our
system is built from the ground up in C++ rather than using sophisticated complex event processing
engines or any other commercial product which is why we believe we have achieved very high throughputs
with very low CPU capacity for processing the events. Our results show that the throughput of our
system is over a million events per second while using under 20% of one core.
view source code
view paper
Erlang is recently developed general purpose highly concurrent functional programming
language. It was designed to support distributed, fault tolerant, scalable, non-stop
applications. It has been used in production systems (e.g. AXD301 ATM switch) with an uptime
percentage of 99.9999999% (nine nine's). It is being used by Facebook, Github, Riak, Amazon
etc. to develop large distributed systems. We have leveraged the distributed capabilities of
Erlang and have developed yet another DFS namely Erlang Distributed File System (eDFS). It
is highly concurrent, reliable, scalable and fault tolerant. In this report, we first develop
small taxonomy of DFSs. We, then describe the architecture of eDFS and compare the design
with the existing DFSs.
view source code
view report
Dynamic Server Consolidation (DSC) is a technique used by cloud
owners to consolidate multiple virtual machines (VM) that run client
application components, onto a minimal number of physical
machines (PM). Dynamic change in the VM-to-PM mapping is achieved
by exploiting the
live migration capability of VMs. In this
paper, we propose an approach which is based on two paradigms:
first, that inputs to the DSC algorithm should be actual revenues
and costs, and no artificial cost should be provided. Second,
that a consolidation decision must take
long term benefit
into account. We first propose a profit model in which costs of
actions such as VM migration decisions manifest as SLA penalties and
increased operating costs. We then use this profit model to
formulate this problem as a
Markov Decision Process
(MDP). We do this by narrowing our scope to workloads that are
cyclic in nature. Although the proposed approach is not
practical yet due to prohibitive complexity of the solution,
preliminary results show that there is good potential in the
approach and further research in this direction will be worthwhile.
view source code
Unified Call Distribution System (UCD) is a distributed system which routes requests
to specific group of agents based on their skill set in a proportionate manner. It is
an extension of
OpenACD
but is independent of the channel associated to any request and therefore, called unified.
The system maintains sessions of agents, connects clients to the agents for communication,
shows the current status on monitor etc. The client can communicate to agent on any kind of
channel such as chat, voice call, email or through any social media network.
The task of routing is carried by an integral component called Call Distributor (CD). It
is developed in Erlang. The other component units Channel Servers (CS) and Agent Servers (AS)
maintains sessions of agents or clients and feed data to CD. It, in turn, maintains real time
data in mnesia database and performs the task of routing. The real time data includes all the
agents logged in, queues of requests for each channel etc. Every request is assigned to an agent
after matching mandatory properties, counting desired properties and at last based on "most idle"
agent definition. Agents can be configured to handle multiple requests at a time. Such a
channel configuration is called parallel channel. We keep mnesia tables with hashed names
where key is properties of agents and value is list of free agents in order to speed up the
routing process.