Pairing On-premises Storage with AWS Outposts for AI

From self-driving cars and financial transactions to transit systems, many use cases require the ability to process data locally in near real-time. Because speed is critical, there’s no time to transfer all that data back to the public cloud.

As a result, many organizations are extending their public-cloud projects to private, on-premises locations and building a hybrid-cloud solution.

AWS Outposts enables you to meet local data-processing or low-latency requirements. With an Outpost rack, you can deploy Amazon Web Services compute hardware in your data center.

Read: Introducing FlashArray for AWS Outposts

Hybrid-cloud Models for AI Deployments

As data-science teams ramp up their models to production, they have to support larger-scale data ingestion and data processing as part of inference.

Often the performance penalty (or cost) of transferring large data sets into the public cloud is prohibitive. In addition, the connection from edge to public cloud may be unreliable. Because of these limitations, teams with massive data ingest are relocating AI inference from the public cloud to edge data centers.

Note: In this case, we use “edge” to mean the distributed computing model where service delivery is performed at multiple edge sites such as colocation facilities.

Teams may centralize model development in the public cloud and then deploy finalized models into edge data centers for inference. Teams can perform initial analytics at the edge to identify anomalies or interesting data points to send back to the public cloud for further analysis or retraining.

hybrid-cloud deployment with on-prem storage

Example hybrid-cloud deployment.

For example, an autonomous-vehicle company might have a fleet of cars driving around generating 2TB logs of data per vehicle each day. Yet, their computer vision training datasets might not have enough samples of a specific signpost variant (or other things the cars might cross paths with). So, the company’s data scientists might perform inference at the edge location to pull out data points with these high-value anomalies that they can then feed into model retraining.

The full datasets are ingested into local servers, and AWS Outposts—maybe even a GPU Outpost configuration—can be used as the edge compute for inference.

Access Data on Local Servers from Outpost EC2 Instances

A senior solutions architect at AWS, Josh Coen, wrote a great article that highlights the simplicity of the networking paths between an Outpost and an adjacent storage device.

After setting up connectivity, it’s easy to start using those external datasets. EC2 instances inside an Outpost can interact with on-prem file and object data.

Prerequisites:

Outpost configuration: Launch an EC2 instance within the Outpost via the AWS Outposts Console. Verify that the instance can ping devices on your local network.
FlashBlade® configuration: Create a subnet, a data VIP, and a filesystem for file testing. Create an Object Store user and a bucket for object testing.

Connect to File Storage

Use the data vip to mount the filesystem to a directory inside the instance. Example:

mkdir -p /mnt/datasets  
mount -t nfs 10.21.239.11:/datasets /mnt/datasets

1 2	mkdir –p /mnt/datasets mount –t nfs 10.21.239.11:/datasets /mnt/datasets

That’s it!

Confirm via 'ls' that the contents of the mounted filesystem appear as expected from within the EC2 instance.

Connect to Object Storage

Like the process for in-Region instances, an IAM role with Amazon S3 access must be associated with the instance before users can access S3. (For more information, refer to the AWS Command Line Interface User Guide.)

Use the 'aws configure' command to add the Access Key ID and Secret Access Key for the FlashBlade object storage user.

That’s it!

Use the 'aws s3 ls' command to verify that the EC2 instance has access to buckets on the FlashBlade. Because the system is using a custom endpoint instead of the default AWS S3, specify an '--endpoint-url' with the previously-created data VIP on FlashBlade.

aws –endpoint-url https://10.21.239.11 s3 ls

1	aws —endpoint–url https://10.21.239.11 s3 ls

At this point, the Outpost EC2 instance is ready to consume both file and object data stored on FlashBlade.

Access Local Data Sets from EMR Clusters inside the Outpost

It’s often insufficient only to perform inference directly from raw data at the edge. Organizations frequently choose to preprocess the data or otherwise manipulate it before running inference. You can use a service like AWS EMR to perform preprocessing steps inside the Outpost.

Example on-prem inference pipeline. Both file and object protocol workloads are supported.

To simplify infrastructure, utilize the same shared storage across the entire edge analytics pipeline.

Deployment

Note: For instructions on launching Amazon EMR clusters into an Outpost, see my post “Create Amazon EMR Clusters on AWS Outposts.”

Once you have the cluster instance(s) running, it’s easy for cluster workloads to access data on the local storage server. To access objects on FlashBlade S3 from this cluster, submit a job with the FlashBlade endpoint and credentials included.

Example:

spark-submit –conf spark.hadoop.fs.s3a.endpoint=https://10.21.239.11 \
             –conf “spark.hadoop.fs.s3a.access.key=#########################################” \
             –conf “spark.hadoop.fs.s3a.secret.key=#########################################” \
             –master yarn \
             –deploy-mode cluster \
             wordcount.py \
             s3a://emily-outpost-bucket/sample.txt

spark–submit —conf spark.hadoop.fs.s3a.endpoint=https://10.21.239.11 \

—conf “spark.hadoop.fs.s3a.access.key=#########################################” \

—conf “spark.hadoop.fs.s3a.secret.key=#########################################” \

—master yarn \

—deploy–mode cluster \

wordcount.py \

s3a://emily-outpost-bucket/sample.txt

Alternatively, document the FlashBlade specs in the spark-defaults.conf file to use them by default automatically.

vi /etc/spark/conf.dist/spark-defaults.conf

Add the following lines to the bottom of the file:

spark.hadoop.fs.s3a.endpoint 10.21.239.11  # YOUR FLASHBLADE DATA VIP
spark.hadoop.fs.s3a.access.key=#########################################
spark.hadoop.fs.s3a.secret.key=#########################################


# Suggested tuning for FlashBlade performance.
spark.hadoop.fs.s3a.fast.upload true
spark.hadoop.fs.s3a.connection.ssl.enabled false
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.hadoop.mapreduce.input.fileinputformat.split.minsize 541073408

spark.hadoop.fs.s3a.endpoint 10.21.239.11 # YOUR FLASHBLADE DATA VIP

spark.hadoop.fs.s3a.access.key=#########################################

spark.hadoop.fs.s3a.secret.key=#########################################

# Suggested tuning for FlashBlade performance.

spark.hadoop.fs.s3a.fast.upload true

spark.hadoop.fs.s3a.connection.ssl.enabled false

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2

spark.hadoop.mapreduce.input.fileinputformat.split.minsize 541073408

You can now submit jobs without specifying the FlashBlade specs in-line:

spark-submit –master yarn \
             –deploy-mode cluster \
             wordcount.py \
             s3a://emily-outpost-bucket/sample.txt

spark–submit —master yarn \

—deploy–mode cluster \

wordcount.py \

s3a://emily-outpost-bucket/sample.txt

Takeaways

Hybrid-cloud deployments are becoming more common, especially to support edge analytics. Using FlashBlade as a storage server, we demonstrated the quick steps to use local file and object storage with AWS Outpost EC2 instances. This enables data scientists to deploy edge inference pipelines that can consume large datasets and perform analytics locally.

Using performant local storage with Outposts helps eliminate the latency of sending analytics queries to the public cloud and the transfer time incurred by pushing all edge data to the public cloud. (And it might even help in complying with data-governance mandates.) By pairing simple-to-manage compute and storage infrastructures, you can make hybrid-cloud deployments easier.

Latest Enhancements: FlashArray™ for AWS Outposts (September 2024 Update)

Since the original discussion on pairing on-premises storage with AWS Outposts, Pure Storage has significantly expanded its hybrid cloud capabilities. The September 2024 release of FlashArray™ for AWS Outposts introduces new levels of performance, scalability, and enterprise-grade resilience to AWS hybrid cloud deployments.

1. Enterprise-Grade Storage Now Fully Integrated with AWS Outposts

FlashArray™ is now available as a fully supported storage solution for AWS Outposts, allowing enterprises to deploy low-latency, high-performance storage alongside their on-premises and cloud-native AWS workloads.
This integration ensures consistent performance, security, and manageability for mission-critical applications running across hybrid environments.

2. Seamless Data Mobility Between On-Prem and AWS

FlashArray™ for AWS Outposts enables seamless data movement between on-premises FlashArrays, AWS Outposts, and AWS regions.
Organizations can now leverage Pure Cloud Block Store™ alongside AWS services, ensuring high availability, disaster recovery, and multi-site replication for critical applications.

3. Advanced AI and ML Workloads with Hybrid Storage

With AI and machine learning adoption on the rise, FlashArray’s ultra-low latency and high IOPS capabilitieshelp accelerate AI model training and inference in AWS Outposts environments.
Enterprises can now keep high-performance AI workloads on-premises while utilizing AWS cloud services for model deployment and analytics, reducing overall latency and data egress costs.

4. Unified Hybrid Management with Pure1® Integration

FlashArray for AWS Outposts is fully integrated with Pure1®, offering AI-driven analytics, predictive monitoring, and automated storage management across on-prem, AWS Outposts, and the public cloud.
IT teams can optimize storage utilization, automate workload placement, and proactively detect performance bottlenecks across their hybrid infrastructure.

5. Enhanced Data Protection and Ransomware Recovery

Pure Storage’s SafeMode™ snapshots are now available for AWS Outposts deployments, ensuring immutable, air-gapped backups to protect against ransomware and cyber threats.
Organizations can implement multi-layered data protection strategies, combining FlashArray snapshots with AWS security tools for comprehensive data resilience.

Why These Updates Matter

The introduction of FlashArray™ for AWS Outposts represents a significant leap in hybrid cloud storage solutions, making it easier for enterprises to build high-performance, scalable, and resilient hybrid infrastructures. These updates provide businesses with greater flexibility, reduced data latency, and enhanced security, ensuring seamless storage operations across on-premises and AWS environments.

For more details, visit the FlashArray for AWS Outposts page.

Learn more about Pure Storage hybrid-cloud solutions and Pure’s partnership and Pure’s partnership with AWS.

To get started right away, contact aws@purestorage.com.

Blog Home

Pairing On-premises Storage with AWS Outposts for AI

Hybrid-cloud Models for AI Deployments

Access Data on Local Servers from Outpost EC2 Instances

Connect to File Storage

Connect to Object Storage

Access Local Data Sets from EMR Clusters inside the Outpost

Deployment

Takeaways

Latest Enhancements: FlashArray™ for AWS Outposts (September 2024 Update)

1. Enterprise-Grade Storage Now Fully Integrated with AWS Outposts

2. Seamless Data Mobility Between On-Prem and AWS

3. Advanced AI and ML Workloads with Hybrid Storage

4. Unified Hybrid Management with Pure1® Integration

5. Enhanced Data Protection and Ransomware Recovery

Why These Updates Matter

How to Protect Oracle Backups from Cyber Threats

Encrypted Replication from FlashArray to Pure Cloud Block Store

Introducing SQL Server 2025: Enterprise-ready AI

How to Create Customised Billing Reports for IT Departments and MSPs with Pure Fusion and AI DevOps

Top Stories

How to Protect Oracle Backups from Cyber Threats

Encrypted Replication from FlashArray to Pure Cloud Block Store

Introducing SQL Server 2025: Enterprise-ready AI

How to Create Customised Billing Reports for IT Departments and MSPs with Pure Fusion and AI DevOps

From Storage to Stream: A Comparison of Leader Election in Portworx, Kafka, and Raft

Pairing On-premises Storage with AWS Outposts for AI

Hybrid-cloud Models for AI Deployments

Access Data on Local Servers from Outpost EC2 Instances

Connect to File Storage

Connect to Object Storage

Access Local Data Sets from EMR Clusters inside the Outpost

Deployment

Takeaways

Latest Enhancements: FlashArray™ for AWS Outposts (September 2024 Update)

1. Enterprise-Grade Storage Now Fully Integrated with AWS Outposts

2. Seamless Data Mobility Between On-Prem and AWS

3. Advanced AI and ML Workloads with Hybrid Storage

4. Unified Hybrid Management with Pure1® Integration

5. Enhanced Data Protection and Ransomware Recovery

Why These Updates Matter

Related Stories

Top Stories