arbisoft brand logo
arbisoft brand logo
Contact Us

Databricks Workflows-How We Automated Our Employee Satisfaction Analytics

Iqra's profile picture
Iqra SarwarPosted on
8-9 Min Read Time

We used to manually run reports for our HR team until last month. Every Monday morning, our team would waste 2 hours pulling employee satisfaction data, cleaning it up, and generating dashboards that were already outdated by the time anyone looked at them. HR would get impatient, and we would start the week already behind.

 

The Breaking Point

The final straw came when our CEO asked for ad-hoc reports during an all-hands meeting. There I was, frantically querying databases while everyone waited. I had to admit that our "real-time" employee satisfaction tracking was actually T+3 days at best. That embarrassing moment finally pushed us to properly implement Databricks Workflows.

 

What Exactly Is Databricks Workflows?

Databricks Workflows is actually an orchestration layer that lets you automate and schedule data pipelines. Strip away the marketing talk, and it's essentially:

 

  • A task scheduler that can run notebooks, SQL, and Python jobs
  • A dependency manager that ensures tasks run in the correct order
  • A resource optimizer that starts and stops compute as needed
  • A monitoring system that alerts you when things break

 

What I appreciate most is how it handles compute resources. Before, our clusters ran excessively, costing us a fortune. With Workflows, we use job clusters that spin up just for the task and terminate when done. Our finance team noticed the savings immediately.

 

DB Blog Image 1.png

 

Our Employee Satisfaction Data Pipeline

Our employee mood tracking system has three main data sources:

 

  1. A daily "mood pulse" where employees record how they're feeling based on different questions (1-5 scale)
  2. Quarterly satisfaction surveys with detailed feedback
  3. Ad-hoc surveys around specific initiatives or events

 

Previously, I manually combined these sources into reports. Now, our Databricks Workflow handles everything automatically.

 

Setting Up the Pipeline

Here's how I built our employee satisfaction workflow:

 

  1. Created a new workflow called "satisfaction_survey_main"
  2. Added the first task to ingest data from various sources.
    This task extracts raw employee mood pulse data by consolidating sources, including historical survey records and Workstream database entries. It captures key fields such as timestamps, employee IDs, departments, and mood scores.
    • The historical survey data was fetched once during initial ingestion.
    • Ongoing ingestion continues from the Workstream database.
    • Runs on our existing cluster (we didn't need a job cluster for this small task)
  3. Created a silver-level processing task that depends on the ingestion task
    • This task cleans and standardizes the mood data
    • Maps employee IDs to departments and teams
    • Handles inconsistencies (like when someone accidentally submits multiple mood responses)
    • Augments with metadata (like company events or holidays that might affect mood)
  4. Added a gold-level task that takes the silver data and creates analytics-ready datasets
    • Calculates aggregations like average mood by department, team, and date
    • Identifies trend changes and statistical anomalies
    • Joins with ad-hoc survey results when available
    • Produces the final tables that power our dashboards

 

The interesting part was connecting these jobs together. In the workflow definition, we use a "depends_on" field to specify that the silver task can only run after the ingestion completes, and the gold task only runs after the silver task finishes.

 

DB Blog Image 2.png

 

Managing Different Data Sources

One challenge was handling our different data frequencies:

 

  • Mood pulse data comes in daily
  • Quarterly surveys arrive, well, quarterly
  • Ad-hoc surveys appear unpredictably

Rather than creating separate workflows, we built intelligence into our tasks:

 

  1. The ingestion task always runs and gets the latest mood pulse data
  2. It checks for new survey data (both quarterly and ad-hoc)
  3. If found, it processes those too; if not, it just processes the mood data
  4. The silver and gold tasks adapt their processing based on what data came in.

This approach means our workflow runs the same way every day, but handles different incoming data appropriately.

 

Parameters for Flexibility

After running the workflow for a few weeks, our HR team started asking for specific reports. Instead of creating multiple workflows, I added parameters:

 

  • survey_type - toggles between pulse, annual, or exit surveys
  • date_range - lets us run historical analyses for specific time periods
  • department_filter - allows reports focused on specific parts of the company
  • sentiment_analysis - enables NLP processing on free-text comments
  • anonymity_threshold - ensures enough responses to maintain employee privacy

 

Now, when HR needs something specific, they just click "Run Now" and change the parameters. No code changes needed, no manual data pulls. In my notebooks, I access the parameters like this: dbutils.widgets.get("date_range"). Pretty straightforward.

 

DB Blog Image 3.pngDB Blog Image 5.pngDB Blog Image 6.png

 

Handling Different Processing Needs

One technical challenge was that different parts of our pipeline needed different resources:

 

  1. Data ingestion is lightweight and runs fine on our shared cluster
  2. The silver task needs more memory for complex joins and data cleaning
  3. Our gold task needs heavy compute for statistical anomaly detection

 

Databricks Workflows solved this elegantly. We specified different clusters for each task in our workflow configuration. The silver and gold tasks run as "run_job_task" types that reference separate job definitions with their own cluster configurations, for example high-memory or high-compute.

 

We also optimized cost by choosing the right compute for the right job. For example, using serverless compute for lightweight tasks optimizes job time by skipping cluster start time compared to shared compute. This approach also positively impacted our overall compute costs, which dropped significantly.

 

Real-Time Alerting for HR

Beyond scheduled reports, we implemented alerting as part of our data observability strategy to detect concerning patterns in employee sentiment for example sudden drops in mood scores:

 

  1. Created a monitoring task that runs after the gold task
  2. It looks for:
    • Teams with declining satisfaction trends over two weeks
    • Departments with scores 20% below the company average
    • Sudden drops in individual employee mood scores
  3. When triggers are detected, the workflow:
    • Creates an alert in our HR dashboard
    • Sends notifications to relevant managers
    • Generates a preliminary analysis suggesting potential factors

 

The real value was in the speed, HR now gets alerts within hours of potential issues instead of waiting for weekly reports.

 

DB Blog Image 7.png

 

Technical Setup Details

For those interested in the actual technical implementation, our workflow configuration uses:

 

  • Parameter passing between tasks to maintain context
  • Bundle deployment for version control of our workflow
  • Health rules that alert us if the job runs longer than 30 minutes
  • Queue management to handle backlogged executions properly

 

The configuration specifies notification settings, health metrics, cluster IDs, notebook paths, job dependencies, and parameters all in a structured format that can be version controlled.

 

Monitoring Our Workflow

The basic Databricks monitoring worked well enough, but we wanted more specific insights. We added:

 

  1. Custom metrics tracking:
    • Number of employees reporting daily (participation rate)
    • Processing time for each task
    • Data quality scores (completeness, consistency)
  2. A monitoring notebook that captures operational metrics:
    • It pulls run data from the Databricks Jobs API
    • Records duration, status, and row counts for each run
    • Identifies trends in processing time or failure rates
  3. A separate dashboard specifically for data pipeline health

 

This has been crucial for capacity planning. For example, we noticed our Monday processing took longer because weekend data arrived in batches. We adjusted our cluster sizing accordingly.

 

The Real Results

After 4 months with our automated workflow:

 

  1. HR gets daily updates instead of weekly ones
  2. Our compute costs dropped by 42% (job clusters vs. always-on)
  3. We can detect satisfaction issues 75% faster than before
  4. Department managers are more engaged with the data (because it's current)

 

The result was very impressive. When our CEO now asks for employee satisfaction stats during meetings, we just point to the real-time dashboard instead of frantically querying data.

 

Pain Points & Lessons Learned

To be honest, it wasn't all smooth sailing. Here are some hard-won lessons:

 

  1. Start with simple flows: My first version tried to process everything in one massive job. Debugging was a nightmare. Break complex workflows into multiple linked workflows.
  2. Watch your cluster configs: Our initial runs would sometimes fail because:
    • Memory settings were too low for our data volume
    • Autoscaling wasn't configured properly
    • Driver node would run out of memory while the workers sat idle
  3. Be careful with delta tables: We initially had locking issues when multiple workflows tried to write to the same delta tables. Switched to a more granular approach with specific write paths.
  4. Set sensible timeouts: Our first health rule triggered alerts if jobs ran longer than 10 minutes. This caused false alarms every Monday when processing was heavier.
  5. Document everything: I created a wiki page explaining the workflow, parameters, and expected outputs. This saved me from endless questions when I went on vacation.

 

What Still Bothers Us

Workflows aren't perfect. our ongoing frustrations:

 

  • Limited parameter validation (it's easy for users to enter invalid values)
  • No built-in data quality monitoring (we had to build our own)
  • The UI locks get annoying when you need to make quick changes
  • Deployment between dev/test/prod environments is still clunky
  • You can't easily clone a workflow with all its dependencies

 

Despite these issues, it's still better than our old manual process or trying to maintain Airflow ourselves.

 

Conclusion

If you're ready to escape spreadsheet hell and manual data pulls:

 

  1. Map out your current reporting process
  2. Identify logical breakpoints between tasks
  3. Determine dependencies between those tasks
  4. Start with a simple workflow that handles your most frequent data needs
  5. Add parameters to make it flexible
  6. Gradually expand to handle more complex cases

 

Don't try to automate everything on day one. Get one piece working well, then expand. We are still learning and improving, and would love to hear what's working for others.

...Loading Related Blogs

Explore More

Have Questions? Let's Talk.

We have got the answers to your questions.