Big Data – Page 7

Introducing AWS Glue 5.0 for Apache Spark

December 11, 2024 by Dutd

AWS Glue is a serverless, scalable data integration service that makes it simple to discover, prepare, move, and integrate data from multiple sources. Today, we are launching AWS Glue 5.0, a new version of AWS Glue that accelerates data integration workloads in AWS. AWS Glue 5.0 upgrades the Spark engines to Apache Spark 3.5.2 and … Read more

Use open table format libraries on AWS Glue 5.0 for Apache Spark

December 11, 2024 by Dutd

Open table formats are emerging in the rapidly evolving domain of big data management, fundamentally altering the landscape of data storage and analysis. These formats, exemplified by Apache Iceberg, Apache Hudi, and Delta Lake, addresses persistent challenges in traditional data lake structures by offering an advanced combination of flexibility, performance, and governance capabilities. By providing … Read more

Enforce fine-grained access control on data lake tables using AWS Glue 5.0 integrated with AWS Lake Formation

December 11, 2024 by Dutd

AWS Glue 5.0 supports fine-grained access control (FGAC) based on your policies defined in AWS Lake Formation. FGAC enables you to granularly control access to your data lake resources at the table, column, and row levels. This level of control is essential for organizations that need to comply with data governance and security regulations, or … Read more

Simplify data access for your enterprise using Amazon SageMaker Lakehouse

December 11, 2024 by Dutd

Organizations are increasingly using data to make decisions and drive innovation. However, building data-driven applications can be challenging. It often requires multiple teams working together and integrating various data sources, tools, and services. For example, creating a targeted marketing app involves data engineers, data scientists, and business analysts using different systems and tools. This complexity … Read more

How REA Group approaches Amazon MSK cluster capacity planning

December 11, 2024 by Dutd

This post was written by Eunice Aguilar and Francisco Rodera from REA Group. Enterprises that need to share and access large amounts of data across multiple domains and services need to build a cloud infrastructure that scales as need changes. REA Group, a digital business that specializes in real estate property, solved this problem using … Read more

Implement historical record lookup and Slowly Changing Dimensions Type-2 using Apache Iceberg

December 11, 2024 by Dutd

In today’s data-driven world, tracking and analyzing changes over time has become essential. As organizations process vast amounts of data, maintaining an accurate historical record is crucial. History management in data systems is fundamental for compliance, business intelligence, data quality, and time-based analysis. It enables organizations to maintain audit trails, perform trend analysis, identify data quality … Read more

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

December 11, 2024 by Dutd

Given the importance of data in the world today, organizations face the dual challenges of managing large-scale, continuously incoming data while vetting its quality and reliability. The importance of publishing only high-quality data can’t be overstated—it’s the foundation for accurate analytics, reliable machine learning (ML) models, and sound decision-making. Equally crucial is the ability to … Read more

Federate to Amazon Redshift Query Editor v2 with Microsoft Entra ID

December 11, 2024 by Dutd

Amazon Redshift is a fast, petabyte-scale, cloud data warehouse that tens of thousands of customers rely on to power their analytics workloads. With its massively parallel processing (MPP) architecture and columnar data storage, Amazon Redshift delivers high price-performance for complex analytical queries against large datasets. To interact with and analyze data stored in Amazon Redshift, … Read more

Run Apache Spark Structured Streaming jobs at scale on Amazon EMR Serverless

December 11, 2024 by Dutd

As data is generated at an unprecedented rate, streaming solutions have become essential for businesses seeking to harness near real-time insights. Streaming data—from social media feeds, IoT devices, e-commerce transactions, and more—requires robust platforms that can process and analyze data as it arrives, enabling immediate decision-making and actions. This is where Apache Spark Structured Streaming … Read more