These sessions were presented at OSA Con 2023 on December 12-14, 2023.
featured image JarekPotiuk.jpg

`New` Workflow Orchestrator in town: Apache Airflow 2.x

by Jarek Potiuk
Quite often you hear about the “new” orchestrator that aims to solve your orchestration needs. You can also often hear how it compares to Airlfow. However those comparisions often overlook the fact that since Airflow 2.0 has been introduced, it continues to evolve and piece-by-piece modernize itself. New UI, New ways of writing your orchestration tasks, new ways to test them. And the …
featured image AviPress.jpg

A Guide to Responsible Data Collection In Open Source

by Avi Press
Collecting usage data in open source can be a controversial topic, but attitudes on this topic have been notably shifting recently. After working with many open source projects and companies over the past 4 years, our team at Scarf has established empirically successful best practices and considerations that all open source projects should be aware of to effectively track the usage of their …
featured image GaborSzarnyas.jpg

An Overview of DuckDB

by Gabor Szarnyas
DuckDB is an analytical database management system. It runs in-process, which makes its configuration trivial and eliminates any overhead between the client application and the database. DuckDB is open-source and highly portable with integrations for Python, R, Java, Julia, and 10+ other languages. DuckDB has top-notch support for data formats (CSV, Parquet, JSON, Iceberg) and data sources (https, …
featured image Julien.jpg

Apache Pulsar: Finally an Alternative to Kafka?

by Julien Jakubowski
Today, when you think about building event-driven and real-time applications, the words that come to you spontaneously are probably: RabbitMQ, ActiveMQ, or Kafka. These are the solutions that dominate this landscape. But have you ever heard of Apache Pulsar? After a brief presentation of the fundamental concepts of messaging, you’ll discover the Apache Pulsar features that enable you to …
featured image JeffreyNelson.jpg

Build a fully-managed OSS compatible lakehouse with BigLake Managed Tables

by Jeffrey Nelson
Open table formats like Apache Iceberg, Apache Hudi, and Delta Lake use embedded metadata, stored alongside data on object stores to provide transactionally consistent DML and time travel features. This metadata is usually backed by a transaction log also stored in object storage. While this approach of maintaining a transaction log on an object store provides simplicity to build an open …
featured image MaryKarin.jpg

Building a ChatGPT Data Pipeline with RisingWave Stream Processor and Astra Vector Search

by Mary Grygleski & Karin Wolok
Enter the exciting brave new world of GenAI, by building a ChatGPT Data Pipeline that leverages on RisingWave’s efficient stream processing write jobs for real-time market data that’s been enriched with Astra/Cassandra’s high performant vector embedding and similarity search.
featured image CameronCyr-CICD.jpg

CICD Pipelines for dbt: DIY or DIWhy?

by Cameron Cyr
Continuous integration and continuous deployment (CICD) pipelines are crucial for deploying efficient and reliable data transformations. In this session we will answer the question “DIY or DIWhy?”, where we will help you decide if you should build your own CICD pipelines for deploying your dbt project, or if you should use dbt Cloud’s out of the box solutions. We’ll examine …
featured image NadineFarah.jpg

Data Alchemy: Transforming Raw Data to Gold with Apache Hudi and DBT

by Nadine Farah
The medallion architecture graduates raw data sitting in operational systems into a set of refined tables in a series of stages, ultimately processing data to serve analytics from gold tables. While there is a deep desire to build this architecture incrementally, it is very challenging with current technologies available on lakehouses. Many technologies can’t efficiently update records or …
featured image AlexMerced.jpg

Data as Code: Project Nessie brings a Git-like experience for Apache Iceberg Tables

by Alex Merced
Multi-table transactions have existed in data warehouses for some time, but with the open source Project Nessie, multi-table transactions and an innovative git-like experience become available to data lakehouses. In this session, learn how Project Nessie enables the new “Data as Code” paradigm allowing for workload isolation, multi-table transactions and experimentation when working with Apache …
featured image Akshay.jpg

Data on GKE

by Akshay Ram
Kubernetes was mostly associated with stateless applications such as web and batch applications. However, like most things, Kubernetes is constantly evolving. These days, we are seeing an exponential increase in the number of stateful apps on Kubernetes. In fact, the number of clusters running stateful apps on Google Kubernetes Engine (GKE) has doubled every year since 2019. Learn how today, …
featured image PatNadolny.jpg

ETL with Meltano + Singer in the LLM era

by Pat Nadolny
Are we reinventing ETL in the LLM app ecosystem? While the existing LLM app tools like LangChain and LlamaIndex are useful for building LLM apps, their ETL features fall short for production use cases. We’ll explore Singer and the Meltano community, new data pipeline needs in the AI space, and how we can apply data engineering principles to solve them. Video of demo
featured image AndreyGusarov.jpg

From Click to Insight: Transforming Streams with Apache Flink

by Andrey Gusarov
In this topic, I’ll delve into using Apache Flink for real-time distributed data processing in diverse product initiatives. From implementing counters and windowed analytics to online data enrichment, I’ll highlight the challenges faced and share insights on harnessing Flink’s capabilities to address these scenarios in high-demand environments
featured image ViktoriaOndrejova.jpg

From Zero to Superset Hero: Data visualisation as a code with Terraform

by Viktoria Ondrejova
In this session, we will share the journey of overcoming common frustrations with Superset. These day-to-day struggles include the time-consuming process of renaming a column in each chart, copy-pasting metrics just to create a slightly similar chart, and many more challenges that grow bigger as your company does. The challenges got so big for us that despite having zero prior experience with the …
featured image MattHarrison.jpg

Getting Started with Polars

by Matt Harrison
Get ready to revolutionize your data analysis with Polars - the newest, most highly optimized dataframe library on the market! In this talk, we’ll introduce you to the power of Polars and show you how it compares to the popular Pandas library.

Going beyond Observability: Grafana for Analytics

by Kyle Cunningham
Grafana is a powerful platform for infrastructure observability and visualization, allowing for easy access to a wide array of operational metrics. It is more than that too however, as Grafana has been quickly adding a multitude of features to support a large variety of data analysis use cases; all while using the same intuitive user experience that Grafana has become known for. See how new …

How to implement Data Contracts with DataHub

by Shirshanka Das
Data contracts have been much discussed in the community of late, with a lot of curiosity around how to approach this concept in practice. We believe data contracts need a harmonizing layer to manage data quality in a uniform manner across a fragmented stack. We are calling this harmonizing layer the Control Plane for Data - powered by the common thread across these systems: metadata. This talk …

How we built a zero-ETL data infrastructure for real-time analytics

by Siddarth Jain
Kenobi is a real-time analytics platform which can ingest JSONs and without any manual intervention for data schema definition or data modelling, enables anyone to create metrics and aggregations on the data. This enables teams to use logs and system-generated events directly without needing to pre-process the data to make them useable. In this talk, we’ll discuss the challenges in creating …

Leveraging object storage: Tiered Storage for ClickHouse

by Arthur Ansquer
Discussing the journey to make Tiered Storage available for Aiven for ClickHouse. From product discovery to the benefits and use cases of leveraging object storage for ClickHouse workloads.

Make data movement limitless and secure with Open Source

by Michel Tricot
In 2017, the average number of SaaS apps used by an organization was 16, and by 2022 that number increased to 110, and this number doesn’t even account for databases and files. Accessing and integrating that data into a warehouse is a significant challenge for organizations. Each of these sources is a data silo with huge variability in data formats and complexity to access. In order to ensure …

Many Faces of Real-time Analytics

by Dunith Dhanushka
Real-time analytics systems derive meaningful insights from continuous streams of data, enabling organizations to make swift decisions and react fast. However, not all real-time analytics systems are made equal. While they share the same goal in the end, there are differences in how they achieve it. This talk aims to classify real-time analytics systems into four main groups based on five …

Maximizing Query Speed and Minimizing Costs in Data Lakes with Open-Source Caching

by Beinan Wang
As data lakes scale in complexity and size, companies face challenges with slow and inconsistent data access, rapidly growing storage costs, and high operation costs when migrating to the cloud. In this talk, we discuss an open-source caching framework we designed to improve performance by 1.5x and reduce storage costs by millions per year. The framework leverages tools like Hadoop, Parquet, Hudi, …

Maybe The Real Modern Data Stack Was the Open Source Tools We Got Along The Way

by Pedram Navid
There are many misconceptions of the Modern Data Stack, and it’s easy to forget the real pain it solved and the value it unlocked. While some people still view the Modern Data Stack as marketing-fluff, I’d like to demonstrate how powerful it can be by reclaiming it using Open Source tooling. With tools like sling, dbt, duckdb, dagster, and more, we can spin up cost-effective, fast, and …

Most "Open Source" AI Isn't. And What We Can Do About That.

by Chris Hazard
What does “Open Source AI” really mean? If you publish the weights for a neural network, is that much different than only publishing an executable binary without the source? What if the model has memorized data or code that it can reproduce without attribution? What if you interrogate a model for why a decision was made, and you get a wrong explanation? How can you debug and fix it, …

Navigating the Landscape of a Fully Open Source Data Stack in 2023

by Maxime Beauchemin

Open Formats: The Happy Accident Disrupting the Data Industry

by Ryan Blue
Analytic databases are quietly going through an unprecedented transformation. Open table formats, led by Apache Iceberg, enable multiple query engines to share one central copy of a table. This will fundamentally change the data industry, by freeing data that’s being held hostage by siloed data vendors. This session will cover the origins and basics of open table formats and show how new …

Open Source BI FTW - Building Compelling Dashboards with Apache Superset

by Evan Rusackas
Open source BI is here, it’s better, it’s cheaper, and it can be everything you need it to be.
featured image SeanHughes.jpg

Open Source Project Report: Evidence - Business Intelligence as Code

by Sean Hughes
Evidence is an open source business intelligence tool where all content is defined in markdown and SQL. This session will give an overview of the project: what it is, why we’re building it, why we chose open source, and the upcoming roadmap. It will also include a look at the newest release, which introduces a client-side SQL runtime powered by DuckDB WebAssembly, interactive filters, and …

OSA Con 2023 Welcome

by Robert Hodges & Maxime Beauchemin

Panel Discussion on Growing a Healthy Open Source Community

by Ali LeClerc, Nadine Farah & Evan Rusackas
Communities are at the heart of the open source movement. Community members help with everything from code contributions to trying out software to marketing. Plus, good communities are just fun to be around. In this panel discussion our experts will discuss strategies and tactics to build communities that are welcoming to all, promote collaborative work, and help make their open source projects a …

Panel: Open Source means Open! Or Does it? The State of Licensing in 2023

by Heather Meeker, Peter Zaitsev, Roman Shaposhnik, Kenaz Kwa & Robert Hodges
Open source licenses are a linchpin of the free and open source software movement. Join our panel of open source experts as we do our yearly check-on the state of licenses. We’ll talk about new developments in licensing this year (Terraform anyone?), revisit what a license actually does for your project, and talk about legal and moral issues when projects relicense. We’ll even take …

Prestissimo : The new generation Presto

by Aditi Pandit
Prestissimo is the latest innovation in the Presto SQL query engine (https://prestodb.io/). It is an ambitious endeavor to replace Presto’s Java based runtime execution with a new state of the art C++ engine based on the concepts of vectorization and runtime optimizations. The Native engine has many benefits : Huge Performance boost and CPU efficiency on account of use of vectorization, SIMD …

Proton : A single binary to tackle streaming and historical analytics

by Ken Chen
Proton is a unified streaming and historical analytic engine which is built on top of ClickHouse code base and is in one single binary. It is the core engine which empowers Timeplus core product and open sourced under apache v2 https://github.com/timeplus-io/proton. In this talk, I will cover its technical internals like watermarking, streaming query state management, its internal streaming store, …

Query Live Data Using Open Source SQL Engines

by Jove Zhong & Gang Tao
Streaming data is rapidly becoming a key component in modern applications, and Apache Kafka, Redpanda and Apache Pulsar have emerged as a popular and powerful platform for managing and processing these data streams. However, as the volume and complexity of streaming data continue to grow, it becomes increasingly critical to have efficient and effective ways of querying and analyzing this data. …

QuestDB: The building blocks of a fast open-source time-series database

by Javier Ramirez
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds. It is no wonder time-series databases are now more popular than …

Real-Time Revolution: Kickstarting Your Journey in Streaming Data

by Zander Matheson
Stream processing is hard! It’s expensive! It’s unnecessary! Batch is all you need! It’s hard to maintain! While some of these may sound true, the world of streaming data has come a long way and it is time we start to take advantage of data in real-time. This talk dips your feet into the world of streaming data and demystifies some of the common misconceptions. We will cover some …

Reducing complexity and increasing performance with Trino

by Cole Bowden
In this talk, we’ll be providing a quick overview of and then a longer update on Trino, the lightning-fast distributed SQL query and federation engine. Trino has been a prominent force in the open source data stack for over a decade, and development on it is as active as ever. It serves a multitude of data sources and has more clients supporting it than ever before. With exciting features …

Reinventing Kafka in the Data Streaming Era

by Jun Rao
Many enterprises are adopting data streaming platforms to take actions on what’s happening in the business in real time. Apache Kafka is becoming the standard for building this platform. In this talk, I will first provide an overview of Kafka and its eco-system: with Kafka as the storage layer, systems like Apache Flink as the real time processing layer, and an integration layer that …

StarRocks: Fast Real-Time Analytics for User-Facing Applications

by Albert Wong
Real-time analytics is essential for user-facing applications, such as e-commerce websites, social media platforms, and streaming services. These applications need to be able to analyze data in real time to provide users with personalized experiences and make recommendations. StarRocks is a high-performance, distributed analytical database that is optimized for real-time analytics. StarRocks can …

The Future of Analytics is Open Source and Cloud Native

by Robert Hodges
Analytic platforms are like cathedrals for data - and their foundations are cloud native. In this talk we’ll discuss the forces driving open source adoption, how cloud native enables flexible analytic applications, and what it means for systems being designed today.

The Need for an Open Standard for Semantic Layer

by Brian Bickell
Join Brian Bickell, Cube’s VP of Strategy and Alliances, as he proposes an open standard for the semantic layer, uniting BI tools, embedded analytics, and AI agents. Using an open standard will bring together data, enabling data tools to introspect data model definitions and seamlessly interoperate within the data stack. By embracing this new approach—data practitioners will reap the …

Unlocking Advanced Log Analytics With ClickHouse and Kafka

by Arul Jegadish
In the landscape of observability, logs reign as a fundamental pillar. Undoubtedly, they are among the most extensively employed telemetry signals. However, beneath their widespread usage by developers lies a complexity that cannot be ignored - logs are verbose, lack structure, and are hard to search and analyze. The pursuit of advanced analytics on this foundation can lead down a costly and …

Unlocking Financial Data with Real-Time Pipelines

by Timothy Spann
Financial institutions thrive on accurate and timely data to drive critical decision-making processes, risk assessments, and regulatory compliance. However, managing and processing vast amounts of financial data in real-time can be a daunting task. To overcome this challenge, modern data engineering solutions have emerged, combining powerful technologies like Apache Flink, Apache NiFi, Apache …

Unlocking Scalable and Efficient Data Storage with Apache Ozone

by Uma Maheswara Rao Gangumalla
In today’s data-driven world, organizations are faced with unprecedented volumes of data and increasingly complex storage requirements. To address these challenges, Apache Ozone emerges as a game-changing solution, redefining the landscape of distributed object storage systems. Apache Ozone is an open-source, highly scalable, and efficient storage system designed to provide a reliable and …

Unveiling the Power of dbt and DuckDB: Hype vs. Reality

by Cameron Cyr
Data professionals and analysts are constantly searching for efficient ways to streamline their ETL/ELT processes. dbt, with its focus on transformation, modeling, and testing, has gained significant traction in the industry. On the other hand, DuckDB, a high-performance analytical database, has gained recognition for its speed and versatility. In this session, we will examine use cases of …

What the Duck?

by Jordan Tigani
DuckDB is taking the analytics world by storm. This talk will talk about what makes DuckDB so ducking awesome. We’ll dig into DuckDB use cases, syntax, connectors, architecture, and features that make it more than just another query engine.

Where the Modern Data Stack has Failed and why Engineering-centric Tools will Reshape the Data World

by Nick Schrock
The “modern data stack” has been a big leap forward for data teams, but it is starting to tear at the seams. It is easy, but it does not scale data teams at demanding organization. Too many tools. Some are too heavy; some are too light. They are not composable or programmable enough, leaving data engineers with a unwieldy set of siloed toosl difficult to program, deploy, and manage in …

Who needs ChatGPT? Rock solid AI pipelines with Hugging Face and Kedro

by Juan Luis Cano
Artificial Intelligence is all the rage, largely thanks to generative systems like ChatGPT, Midjourney, and the like. These commercial systems are very sophisticated and powerful, but also a bit opaque if you want to learn how they work or adapt them to your needs. What happens inside the ‘black box’? Luckily there are open AI models that you can download comfortably, study without …

You put OLTP in my OLAP! Analytics and Real-time Converged

by Felipe Mendes
Analytics (OLAP) and Real-time (OLTP) workloads serve distinctly different purposes. OLAP is optimized for data analysis and reporting, while OLTP is optimized for real-time low-latency traffic. Most databases are designed to primarily benefit from one of them. Worse, concurrently running both workloads under the same datastore will frequently introduce resource contention, where the workloads end …