x

Introducing AWS Glue 5.0 for Apache Spark

Introducing AWS Glue 5.0 for Apache Spark

AWS Glue is a serverless, scalable data integration service that makes it simple to discover, prepare, move, and integrate data from multiple sources. Today, we are launching AWS Glue 5.0, a new version of AWS Glue that accelerates data integration workloads in AWS. AWS Glue 5.0 upgrades the Spark engines to Apache Spark 3.5.2 and … Read more

Use open table format libraries on AWS Glue 5.0 for Apache Spark

Use open table format libraries on AWS Glue 5.0 for Apache Spark

Open table formats are emerging in the rapidly evolving domain of big data management, fundamentally altering the landscape of data storage and analysis. These formats, exemplified by Apache Iceberg, Apache Hudi, and Delta Lake, addresses persistent challenges in traditional data lake structures by offering an advanced combination of flexibility, performance, and governance capabilities. By providing … Read more

Enforce fine-grained access control on data lake tables using AWS Glue 5.0 integrated with AWS Lake Formation

Enforce fine-grained access control on data lake tables using AWS Glue 5.0 integrated with AWS Lake Formation

AWS Glue 5.0 supports fine-grained access control (FGAC) based on your policies defined in AWS Lake Formation. FGAC enables you to granularly control access to your data lake resources at the table, column, and row levels. This level of control is essential for organizations that need to comply with data governance and security regulations, or … Read more

Simplify data access for your enterprise using Amazon SageMaker Lakehouse

Simplify data access for your enterprise using Amazon SageMaker Lakehouse

Organizations are increasingly using data to make decisions and drive innovation. However, building data-driven applications can be challenging. It often requires multiple teams working together and integrating various data sources, tools, and services. For example, creating a targeted marketing app involves data engineers, data scientists, and business analysts using different systems and tools. This complexity … Read more

Implement historical record lookup and Slowly Changing Dimensions Type-2 using Apache Iceberg

Implement historical record lookup and Slowly Changing Dimensions Type-2 using Apache Iceberg

In today’s data-driven world, tracking and analyzing changes over time has become essential. As organizations process vast amounts of data, maintaining an accurate historical record is crucial. History management in data systems is fundamental for compliance, business intelligence, data quality, and time-based analysis. It enables organizations to maintain audit trails, perform trend analysis, identify data quality … Read more

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

Given the importance of data in the world today, organizations face the dual challenges of managing large-scale, continuously incoming data while vetting its quality and reliability. The importance of publishing only high-quality data can’t be overstated—it’s the foundation for accurate analytics, reliable machine learning (ML) models, and sound decision-making. Equally crucial is the ability to … Read more

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

Amazon Redshift is a fast, petabyte-scale, cloud data warehouse that tens of thousands of customers rely on to power their analytics workloads. With its massively parallel processing (MPP) architecture and columnar data storage, Amazon Redshift delivers high price-performance for complex analytical queries against large datasets. To interact with and analyze data stored in Amazon Redshift, … Read more

Run Apache Spark Structured Streaming jobs at scale on Amazon EMR Serverless

Run Apache Spark Structured Streaming jobs at scale on Amazon EMR Serverless

As data is generated at an unprecedented rate, streaming solutions have become essential for businesses seeking to harness near real-time insights. Streaming data—from social media feeds, IoT devices, e-commerce transactions, and more—requires robust platforms that can process and analyze data as it arrives, enabling immediate decision-making and actions. This is where Apache Spark Structured Streaming … Read more