“Arbisoft is an integral part of our team and we probably wouldn't be here today without them. Some of their team has worked with us for 5-8 years and we've built a trusted business relationship. We share successes together.”
“They delivered a high-quality product and their customer service was excellent. We’ve had other teams approach us, asking to use it for their own projects”.
“Arbisoft has been a valued partner to edX since 2013. We work with their engineers day in and day out to advance the Open edX platform and support our learners across the world.”
81.8% NPS78% of our clients believe that Arbisoft is better than most other providers they have worked with.
Arbisoft is your one-stop shop when it comes to your eLearning needs. Our Ed-tech services are designed to improve the learning experience and simplify educational operations.
“Arbisoft has been a valued partner to edX since 2013. We work with their engineers day in and day out to advance the Open edX platform and support our learners across the world.”
Get cutting-edge travel tech solutions that cater to your users’ every need. We have been employing the latest technology to build custom travel solutions for our clients since 2007.
“Arbisoft has been my most trusted technology partner for now over 15 years. Arbisoft has very unique methods of recruiting and training, and the results demonstrate that. They have great teams, great positive attitudes and great communication.”
As a long-time contributor to the healthcare industry, we have been at the forefront of developing custom healthcare technology solutions that have benefitted millions.
I wanted to tell you how much I appreciate the work you and your team have been doing of all the overseas teams I've worked with, yours is the most communicative, most responsive and most talented.
We take pride in meeting the most complex needs of our clients and developing stellar fintech solutions that deliver the greatest value in every aspect.
“Arbisoft is an integral part of our team and we probably wouldn't be here today without them. Some of their team has worked with us for 5-8 years and we've built a trusted business relationship. We share successes together.”
Unlock innovative solutions for your e-commerce business with Arbisoft’s seasoned workforce. Reach out to us with your needs and let’s get to work!
The development team at Arbisoft is very skilled and proactive. They communicate well, raise concerns when they think a development approach wont work and go out of their way to ensure client needs are met.
Arbisoft is a holistic technology partner, adept at tailoring solutions that cater to business needs across industries. Partner with us to go from conception to completion!
“The app has generated significant revenue and received industry awards, which is attributed to Arbisoft’s work. Team members are proactive, collaborative, and responsive”.
“Arbisoft partnered with Travelliance (TVA) to develop Accounting, Reporting, & Operations solutions. We helped cut downtime to zero, providing 24/7 support, and making sure their database of 7 million users functions smoothly.”
“I couldn’t be more pleased with the Arbisoft team. Their engineering product is top-notch, as is their client relations and account management. From the beginning, they felt like members of our own team—true partners rather than vendors.”
Arbisoft was an invaluable partner in developing TripScanner, as they served as my outsourced website and software development team. Arbisoft did an incredible job, building TripScanner end-to-end, and completing the project on time and within budget at a fraction of the cost of a US-based developer.
We used to manually run reports for our HR team until last month. Every Monday morning, our team would waste 2 hours pulling employee satisfaction data, cleaning it up, and generating dashboards that were already outdated by the time anyone looked at them. HR would get impatient, and we would start the week already behind.
The Breaking Point
The final straw came when our CEO asked for ad-hoc reports during an all-hands meeting. There I was, frantically querying databases while everyone waited. I had to admit that our "real-time" employee satisfaction tracking was actually T+3 days at best. That embarrassing moment finally pushed us to properly implement Databricks Workflows.
What Exactly Is Databricks Workflows?
Databricks Workflows is actually an orchestration layer that lets you automate and schedule data pipelines. Strip away the marketing talk, and it's essentially:
A task scheduler that can run notebooks, SQL, and Python jobs
A dependency manager that ensures tasks run in the correct order
A resource optimizer that starts and stops compute as needed
A monitoring system that alerts you when things break
What I appreciate most is how it handles compute resources. Before, our clusters ran excessively, costing us a fortune. With Workflows, we use job clusters that spin up just for the task and terminate when done. Our finance team noticed the savings immediately.
Our Employee Satisfaction Data Pipeline
Our employee mood tracking system has three main data sources:
A daily "mood pulse" where employees record how they're feeling based on different questions (1-5 scale)
Quarterly satisfaction surveys with detailed feedback
Ad-hoc surveys around specific initiatives or events
Previously, I manually combined these sources into reports. Now, our Databricks Workflow handles everything automatically.
Setting Up the Pipeline
Here's how I built our employee satisfaction workflow:
Created a new workflow called "satisfaction_survey_main"
Added the first task to ingest data from various sources. This task extracts raw employee mood pulse data by consolidating sources, including historical survey records and Workstream database entries. It captures key fields such as timestamps, employee IDs, departments, and mood scores.
The historical survey data was fetched once during initial ingestion.
Ongoing ingestion continues from the Workstream database.
Runs on our existing cluster (we didn't need a job cluster for this small task)
Created a silver-level processing task that depends on the ingestion task
This task cleans and standardizes the mood data
Maps employee IDs to departments and teams
Handles inconsistencies (like when someone accidentally submits multiple mood responses)
Augments with metadata (like company events or holidays that might affect mood)
Added a gold-level task that takes the silver data and creates analytics-ready datasets
Calculates aggregations like average mood by department, team, and date
Identifies trend changes and statistical anomalies
Joins with ad-hoc survey results when available
Produces the final tables that power our dashboards
The interesting part was connecting these jobs together. In the workflow definition, we use a "depends_on" field to specify that the silver task can only run after the ingestion completes, and the gold task only runs after the silver task finishes.
Managing Different Data Sources
One challenge was handling our different data frequencies:
Mood pulse data comes in daily
Quarterly surveys arrive, well, quarterly
Ad-hoc surveys appear unpredictably
Rather than creating separate workflows, we built intelligence into our tasks:
The ingestion task always runs and gets the latest mood pulse data
It checks for new survey data (both quarterly and ad-hoc)
If found, it processes those too; if not, it just processes the mood data
The silver and gold tasks adapt their processing based on what data came in.
This approach means our workflow runs the same way every day, but handles different incoming data appropriately.
Parameters for Flexibility
After running the workflow for a few weeks, our HR team started asking for specific reports. Instead of creating multiple workflows, I added parameters:
survey_type - toggles between pulse, annual, or exit surveys
date_range - lets us run historical analyses for specific time periods
department_filter - allows reports focused on specific parts of the company
sentiment_analysis - enables NLP processing on free-text comments
anonymity_threshold - ensures enough responses to maintain employee privacy
Now, when HR needs something specific, they just click "Run Now" and change the parameters. No code changes needed, no manual data pulls. In my notebooks, I access the parameters like this: dbutils.widgets.get("date_range"). Pretty straightforward.
Handling Different Processing Needs
One technical challenge was that different parts of our pipeline needed different resources:
Data ingestion is lightweight and runs fine on our shared cluster
The silver task needs more memory for complex joins and data cleaning
Our gold task needs heavy compute for statistical anomaly detection
Databricks Workflows solved this elegantly. We specified different clusters for each task in our workflow configuration. The silver and gold tasks run as "run_job_task" types that reference separate job definitions with their own cluster configurations, for example high-memory or high-compute.
We also optimized cost by choosing the right compute for the right job. For example, using serverless compute for lightweight tasks optimizes job time by skipping cluster start time compared to shared compute. This approach also positively impacted our overall compute costs, which dropped significantly.
Real-Time Alerting for HR
Beyond scheduled reports, we implemented alerting as part of our data observability strategy to detect concerning patterns in employee sentiment for example sudden drops in mood scores:
Created a monitoring task that runs after the gold task
It looks for:
Teams with declining satisfaction trends over two weeks
Departments with scores 20% below the company average
Sudden drops in individual employee mood scores
When triggers are detected, the workflow:
Creates an alert in our HR dashboard
Sends notifications to relevant managers
Generates a preliminary analysis suggesting potential factors
The real value was in the speed, HR now gets alerts within hours of potential issues instead of waiting for weekly reports.
Technical Setup Details
For those interested in the actual technical implementation, our workflow configuration uses:
Parameter passing between tasks to maintain context
Bundle deployment for version control of our workflow
Health rules that alert us if the job runs longer than 30 minutes
Queue management to handle backlogged executions properly
The configuration specifies notification settings, health metrics, cluster IDs, notebook paths, job dependencies, and parameters all in a structured format that can be version controlled.
Monitoring Our Workflow
The basic Databricks monitoring worked well enough, but we wanted more specific insights. We added:
Custom metrics tracking:
Number of employees reporting daily (participation rate)
Processing time for each task
Data quality scores (completeness, consistency)
A monitoring notebook that captures operational metrics:
It pulls run data from the Databricks Jobs API
Records duration, status, and row counts for each run
Identifies trends in processing time or failure rates
A separate dashboard specifically for data pipeline health
This has been crucial for capacity planning. For example, we noticed our Monday processing took longer because weekend data arrived in batches. We adjusted our cluster sizing accordingly.
The Real Results
After 4 months with our automated workflow:
HR gets daily updates instead of weekly ones
Our compute costs dropped by 42% (job clusters vs. always-on)
We can detect satisfaction issues 75% faster than before
Department managers are more engaged with the data (because it's current)
The result was very impressive. When our CEO now asks for employee satisfaction stats during meetings, we just point to the real-time dashboard instead of frantically querying data.
Pain Points & Lessons Learned
To be honest, it wasn't all smooth sailing. Here are some hard-won lessons:
Start with simple flows: My first version tried to process everything in one massive job. Debugging was a nightmare. Break complex workflows into multiple linked workflows.
Watch your cluster configs: Our initial runs would sometimes fail because:
Memory settings were too low for our data volume
Autoscaling wasn't configured properly
Driver node would run out of memory while the workers sat idle
Be careful with delta tables: We initially had locking issues when multiple workflows tried to write to the same delta tables. Switched to a more granular approach with specific write paths.
Set sensible timeouts: Our first health rule triggered alerts if jobs ran longer than 10 minutes. This caused false alarms every Monday when processing was heavier.
Document everything: I created a wiki page explaining the workflow, parameters, and expected outputs. This saved me from endless questions when I went on vacation.
Limited parameter validation (it's easy for users to enter invalid values)
No built-in data quality monitoring (we had to build our own)
The UI locks get annoying when you need to make quick changes
Deployment between dev/test/prod environments is still clunky
You can't easily clone a workflow with all its dependencies
Despite these issues, it's still better than our old manual process or trying to maintain Airflow ourselves.
Conclusion
If you're ready to escape spreadsheet hell and manual data pulls:
Map out your current reporting process
Identify logical breakpoints between tasks
Determine dependencies between those tasks
Start with a simple workflow that handles your most frequent data needs
Add parameters to make it flexible
Gradually expand to handle more complex cases
Don't try to automate everything on day one. Get one piece working well, then expand. We are still learning and improving, and would love to hear what's working for others.