
Preamble
As enterprises accelerate their adoption of generative AI (GenAI), selecting an appropriate technical architecture becomes a critical strategic decision. The right choice depends on key considerations, including cost management, data sensitivity, internal capability readiness, and the desired breadth of available AI models. This paper explores three core deployment configurations—fully managed APIs, cloud-hosted virtual private servers, and on-premises private hardware—highlighting the implications of each configuration across these dimensions.
Senior leaders will gain clarity on the practical trade-offs involved, helping to guide informed decisions on architecture, scale, and resource allocation for successful GenAI integration.
We will also cover some of the current hardware and cloud options available on the market.
Note: The technology around LLMs and GenAI is rapidly evolving. This is our analysis at the time of writing the paper.
Core architecture – Solution Space
Deploying generative AI (GenAI) within enterprises involves selecting an appropriate technical architecture. Three core architectural options are available: accessing GenAI models via APIs, hosting models on virtual private servers (cloud-based), and deploying models on privately owned hardware (on-premise).
Closed models like OpenAI, Claude, Gemini, Grok etc. cannot be deployed locally. The discussion around deploying LLMs onto local server or private VM, is mainly around Open Models like Llama, DeepSeek, Mistral, etc.
- API-based Services (Fully Managed)
Enterprises connect to externally hosted GenAI models through application programming interfaces (APIs). This method uses cloud-based services managed entirely by third-party providers, with the enterprise applications interacting directly with external APIs.
- Virtual Private Servers (Cloud-hosted Private Instances)
In this configuration, enterprises deploy GenAI models within dedicated virtual private servers (VPS) provided by cloud vendors. Enterprises manage their own model instances, infrastructure setup, and resource allocation within a cloud-based environment.
- Private Hardware Deployments (On-premise)
Enterprises deploy and operate GenAI models on physical hardware owned and managed within their own data centers or facilities. This configuration includes internal control over all hardware, software, and infrastructure involved.
Considerations to select the right architecture
Selecting the optimal architecture for deploying generative AI within an enterprise requires balancing multiple strategic considerations. The following table outlines key factors—including cost, data security, maintenance needs, and model availability—to guide informed decision-making across the three core deployment configurations.
Table 1: Highlight of key factors
Consideration
|
API-based Services (Fully Managed)
|
Virtual Private Servers (Cloud-hosted)
|
Private Hardware (On-premise) |
---|---|---|---|
Cost
|
Typically lower initial costs (
pay-per-use); scales linearly with usage; Potentially higher long term cost
|
Moderate cost; includes cloud infrastructure and operational expenses; moderate upfront investment | Higher initial investment due to hardware purchase and infrastructure setup; potentially lower ongoing costs at scale |
Data Security
|
Data leaves enterprise environment
; security governed by third-party provider policies
|
Enhanced security within cloud-based private environments; better isolation and control than public APIs | Highest level of data security; data remains fully within enterprise premises |
Maintenance
|
Minimal maintenance required; vendor manages infrastructure and model updates
|
Moderate maintenance; enterprise responsible for model deployment and infrastructure management, but cloud vendor maintains hardware | Highest maintenance burden; enterprise responsible for all hardware, infrastructure, model deployments, and updates |
Breadth of LLMs
|
Broadest availability
: includes both closed-source proprietary models (e.g., GPT-4, Gemini) via vendor APIs, and open-source models (via API providers like
|
Limited to open-source models or models with licensed/self-hosted weights (proprietary, closed-source models like GPT-4 or Gemini typically unavailable) | Limited to open-source models or models with licensed/self-hosted weights (proprietary, closed-source models like GPT-4 or Gemini typically unavailable) |
AI Agent Ecosystem
|
Embedded into the wider ecosystem. | Requires implementation effort. | Requires implementation effort. |
Total cost of ownership (TCO)
Costs of GenAI projects vary substantially, making it crucial to understand the drivers of TCO. One primary factor is the inference cost associated with the usage of different LLMs.
To illustrate this, we simulated a scenario of an API-based application processing 5 billion input tokens and 2.5 billion output tokens per month, roughly the equivalent of 10 million pages of input text and 5 million pages of output text.
The estimated monthly inference costs vary significantly, from USD 1,350 for the most economical LLM in our simulation (LLaMa 3-8B) to USD 262,500 for the premium Opus 4 model.
Selecting the right model for each task is therefore a critical competency for effective cost management.
Exhibit 1: Total cost of ownership

But costs of GenAI projects are not merely dependent on the model selection. Additional complexity arises from the underlying architecture, each introducing distinct cost considerations that impact the TCO:
- API-based Services:
Simplest, clearly defined pricing structure; minimal complexity. - Virtual Private Servers (Cloud-hosted):
Moderate complexity; includes clearly defined cloud infrastructure costs and manageable operational overhead. - Private Hardware (On-premise):
Highest complexity; multiple cost dimensions (hardware, infrastructure, operational staff, facility overhead, licensing) requiring careful estimation and long-term planning.
The broad categories of cost across the three solution configurations are highlighted below.
Table 2 : Categories of cost
Cost Component | API-based Services (Fully Managed) |
Virtual Private Servers (Cloud-hosted)
|
Private Hardware (On-premise) |
---|---|---|---|
Initial Investment Costs
|
Minimal setup fees, initial integration costs
|
Cloud infrastructure setup, configuration, and initial integration
|
Hardware procurement (GPUs/CPUs, servers, networking, storage), facility setup costs
|
Infrastructure Costs (Ongoing)
|
Not directly (included in usage fees)
|
Cloud service fees (compute, storage, bandwidth, backups)
|
Data center operating costs, electricity, cooling,
and real estate
|
Usage Costs (Model Inference)
|
Pay-per-token/API-call fees
|
Cloud compute instances (hourly or monthly rates)
|
Hardware depreciation, power consumption,
and compute resource allocation costs
|
Model Licensing Costs
|
Included in API fees
|
Possible licensing for certain premium models
|
Possible licensing for certain premium models or commercial open-source support agreements
|
Model Training/Fine-tuning Costs
|
Typically pay-as-you-go or subscription fees for fine-tuning services
|
Compute costs of training instances in the cloud, storage fees
|
Hardware use for training, electricity, and specialized storage infrastructure
|
Software Infrastructure Costs
|
Minimal (often part of API services or enterprise stack)
|
Operating system licenses, container/orchestration platform licenses (if applicable), software management tools
|
Operating system licenses, management software licenses (VMware, Kubernetes/OpenShift, security platforms), monitoring tools
|
Security & Compliance Costs
|
Basic compliance (often included), optional add-on services
|
Cloud security management tools, compliance management services, and access control services
|
On-prem security systems, network security, compliance audits, certifications
|
Data Transfer & Networking Costs
|
API request data charges, usually minimal
|
Cloud network egress charges, inter-region transfer fees
|
On-prem network infrastructure maintenance, ISP connectivity, dedicated fiber/network management
|
Disaster Recovery & Backup Costs
|
Typically included or minimal add-on fee
|
Cloud-based backup and disaster recovery fees
|
Backup infrastructure, disaster recovery planning, and off-site backups
|
Other Considerations
Data Security Considerations
While the majority of enterprises manage some sensitive data, certain sectors face distinct privacy concerns with generative AI, for example:
- Retail: Retailers face critical concerns around protecting customer data especially if they are using AI to perform analysis of their customer demographics, segmentation, and personalization. If they do not intend to sufficiently mask customer data in a cloud environment they should take steps to ensure the data is securely housed.
- Healthcare: Major concerns include accidental disclosure of protected health information (PHI), compliance with strict regulations like HIPAA, and ensuring accuracy to avoid harmful errors.
- Finance: Financial institutions emphasize safeguarding sensitive client and proprietary information, regulatory compliance, and preventing data leaks or misuse.
- Government: Public-sector agencies prioritize data sovereignty, national security, compliance with stringent privacy laws, and preventing inadvertent exposure of confidential citizen data.
- Legal: Law firms face critical concerns around maintaining attorney-client confidentiality, protecting privileged case information, complying with professional ethics standards, and avoiding accidental disclosures via external AI services.
As the data regulatory requirements and cost of non-compliance across regions are likely to increase it will become more important for companies to disclose where data is being stored to company boards shareholders and customers. Therefore, the decision of where data is stored and who has access will become a key decision point for the genAI architecture model chosen.
Currently, Data Security encompasses the LLM security as the risk with LLMs is how the data is used for training purpose. However, as LLMs become more prominent, there will be regulations specific to the LLMs.
In summary, the architecture, security and risk team of the enterprise/ corporation will need to consider these factors before making a decision on the architecture.
Team Capabilities Considerations
Deploying LLMs on-premises grants organizations greater control and data security but requires mastering a distinct set of capabilities compared to API-based services.
Regardless of the chosen architecture, foundational AI capabilities must be developed when scaling GenAI applications. These capabilities are labeled as GenOps and involve the integration and optimization of GenAI application components such as prompt engineering, guardrails and cost analytics.
Choosing an on-premises architecture further requires the development of capabilities known as LLMOps. These skills cover distinct capabilities including advanced model serving, GPU infrastructure management and infrastructure optimization. Critically, LLMOps capabilities are not just extensions of GenOps skills. LLMOps skills demand deep expertise that sit closer to platform teams, whereas GenOps teams align closer with the application engineers. As a consequence, organizations that decide to run open-source models on-premesis must develop an additional AI capability and team.
The figure below expands on the capabilities needed for both GenOps and LLMOps.
Exhibit 2: AI Engineering Capabilities

LLM Model Availability Considerations
The selection of available models for on-premises deployments is generally more limited compared to cloud-based API solutions. While cloud API providers offer immediate access to a wide array of both proprietary models (such as OpenAI’s GPT series or Google’s Gemini) and open-source models through third-party API services, on-premises deployments typically rely solely on open-source or commercially licensed models with publicly available weights. Proprietary models, widely regarded as state-of-the-art, usually cannot be hosted internally due to licensing restrictions and providers’ policies. Consequently, enterprises adopting an on-premises approach might face a narrower range of model choices, potentially affecting flexibility and capability.
However, the rate of change in the LLM market is extremely fast and with new open source models coming out and increased development of the existing models this gap may shorten over time. With the right architecture, the ability to swap/ update models can be managed to avoid re-implementation.
The following tables cover three types of models – Highly Efficient Models, Mid-Ranged Models and Small Language Models. The parameter represents the size of the model, the developer is the institute who developed the initial model and the key feature shows the important feature of the model.
Table 3: Highly Efficient Models (Best for Small Servers)
Model
|
Parameters |
Developer
|
Key Features
|
---|---|---|---|
Llama 3 8B
|
8 billion
|
Meta
|
Compact model balancing performance and resource usage
|
Mistral 7B
|
7 billion
|
Mistral AI
|
Excellent performance-to-size ratio, good for coding tasks
|
Phi-3 Mini
|
3.8 billion
|
Microsoft
|
Small but capable model
|
TinyLlama
|
1.1 billion
|
Various Contributors
|
Extremely lightweight option for basic tasks
|
Gemma
|
2B/7B
|
Google
|
Efficient models with good instruction following
|
Table 4: Mid-Range Models (Moderate Small Server Requirements)
Model
|
Parameters |
Developer
|
Key Features
|
---|---|---|---|
Llama 3 8B
|
70 billion
|
Meta
|
Larger model with stronger capabilities
|
Mistral 7B
|
~45 billion effective
|
Mistral AI
|
Mixture of experts architecture with strong performance
|
Phi-3 Mini
|
13 billion
|
Microsoft
|
Fine-tuned version with better instruction following
|
TinyLlama
|
40 billion
|
Various Contributors
|
Powerful open model with broad knowledge
|
Gemma
|
7B/13B
|
Google
|
Specialized for programming tasks
|
SML Models (Moderate Small Server Requirements)
Small Model Languages (SML) are gaining traction for specific use cases. They are easier to deploy and works well for specific use cases. As these get more advanced, these might be better for deploying on EDGE devices or very specific use cases.
Table 5: SML Models (Moderate Small Server Requirements)
Model | Developer | Key Features |
---|---|---|
MobileBERT
|
Google, Carnegie Mellon University
|
A compressed BERT model for mobile devices
|
DistilBERT
|
Hugging Face
|
A lighter, faster version of BERT with 40% size reduction
|
BERT-Tiny
|
Google
|
An extremely small variant with only 4.4M parameters
|
GPT2-Small
|
Open AI
|
The smallest GPT-2 variant at 124M parameters
|
FLAN-T5 Small
|
Google
|
Google’s compact instruction-tuned model
|
The models mentioned above are at the time of writing. This is a rapidly evolving area and needs to be monitored closely.
Model selection and infrastructure requirements
LLM selection directly impacts the infrastructure requirements and costs. Infrastructure costs and performance are dependent on a set of diverse parameters, such as GPU memory, computational power (Tflops) and memory bandwidth. While a comprehensive calculation of these costs goes beyond this paper’s scope, we will briefly discuss how LLM choice affects GPU memory requirements, as this is one of the primary drivers of infrastructure cost.
There is a simple rule of thumb to calculate the GPU RAM needs based on the number of parameters of the LLM. For inference, 2 bytes of RAM are needed for each parameter, with an additional 20% for overhead. Let’s apply this on the smallest and largest LLama model available:
- Llama 3.1 – 405B requires an investment of USD 280,000 in GPUs. The model needs 972 GB of RAM. An 8x NVIDIA H200 GPU setup (141 GB each) provides sufficient capacity to run inference on the model. At a cost of USD 35,000 per GPU, this results in USD 280,000.
- TinyLlama requires a GPU investment of only USD 300. With its 1.1B parameters it needs 2.64 GB of RAM. The NVIDIA RTX 4060, equipped with 8 GB of RAM comfortably supports this requirement at a price of ~300 USD.
This comparison underscores the impact that LLM selection has on infrastructure and cost. When organizations decide to deploy an on-premises GenAI infrastructure, it is crucial to identify the LLMs that match their needs while controlling cost.
Getting started – from pilot to scale
Many groups start on the road to GenAI without a clear end goal. They are either in exploratory mode, have a concept of a business case or are using readily available solutions for corporate consumption, but also want to “see how it goes” before committing to a full genAI investment.
Even if the group is in the “see how it goes” mode, they should understand what a roadmap is for the scaled out genAI architecture as this will help with the first investment and approach.
The simplest place to start is either the cloud partner or a on/premise hardware option if data is considered highly sensitive.
For cloud based deployments it is a relatively straight forward process of procuring and implementing the user case.
For the hardware option, future growth of the usage should be taken into consideration. In this case, consider a possible growth from a single POC machine to a scaled out 3 tier architecture if required ( GPU server, application server, data server).
For the purposes of giving an example vendor approach – let us examine the Oracle LLM solution, cloud options and on-premise hardware options.
The following details the LLM model process available in the Oracle OCI cloud.
Exhibit 3: LLM model process available in the Oracle OCI cloud

Exhibit 4 : LLM model cloud options

In terms of on-premise options Oracle has a private cloud option for both the application server, data server and the GPU server.
Exhibit 5 : Oracle AI Optimized Infrastructure Architecture

Some Callouts:
- Support ability- If the choice is an On-premise deployment model, consideration should be given to the support model like LLM Ops e.g. resource, latency, upgrades, etc.
- GenAI OPs considerations – no matter what option is chosen, the application layer should consider aGenAI Ops to manage the application. This will allow a solid foundation and management for future growth.
- Security/Privacy – If the sectors are highly regulated like Medical, Legal etc., the enterprise should start considering a team for LLMOps with the option of starting with On-Prem deployment. This allows the team to architect the deployment to enable scaling the Ops while preserving the data security.
Use case evaluator
There are three main categories of Use Cases that we need to consider when making a decision of the type of LLM to use.
- Category 1 – Using LLMs to interact with Data: These will be used for use cases where the LLM is used purely for interacting with data. These LLM can be pre-trained and does not require a lot of parameters. These can be deployed locally using an Open Model onto a small machine in a specific premise. The machine can be scaled to include multiple CPUs/ GPUs or deployed onto a private VM. This is a good option for deploying at a premise where data security

- Category 2 – Using LLMs to enhance context before interacting with data: These will use API calls to larger LLMs to get better context and then interact with the data. These can be used when the parameters are broader and are not simple interactions with the users. The privacy concern needs to be addressed by ensuring that no sensitive data is sent via API call. The API call should be purely for enhancing the context and not interacting with sensitive data.

- Category 3 – AI Agents using LLMs to interact with data: This category is similar to Category 2. However the users/ enterprise applications interact with AI Agents. The AI agents will determine the course of action which can be an LLM to enhance context prior to interacting with the data.

In all of the three categories, the sensitive data itself will reside within the secured network, and this data is not used for training the models.
The first type of LLMs are simpler and cheaper to start with, but work for specific use cases. The training is also done for specific use cases and hence the number of parameters that the LLMs is built on is limited. However, these can work very well for the use cases that they are trained on.
The second type uses a combination of localised, smaller LLM deployed with in the network while also using the larger LLM via API call for enhancing the context. The cost is higher determined by the number of users, API calls etc. This also provides a large flexibility in how simpler use cases can be enhanced as the user base and complexity increase.
Summary
The selection of an optimal GenAI deployment architecture represents a strategic investment that balances immediate operational needs with long-term digital transformation objectives. By evaluating the distinct advantages and limitations of managed API services, cloud-hosted virtual private servers, and on-premises hardware solutions, organizations can align their infrastructure choices with their specific requirements for cost efficiency, data governance, organizational capabilities, and model flexibility. As the GenAI landscape continues to evolve, decision-makers armed with a clear understanding of these architectural trade-offs will be better positioned to make informed investments in infrastructure, talent, and partnerships.
This strategic alignment ensures that AI implementation not only addresses current business challenges but also establishes a scalable foundation that can adapt to emerging technologies and changing market demands. The hardware and cloud options detailed in this paper provide a starting point for this journey, offering practical pathways to successful enterprise AI integration regardless of organizational size or technical maturity.
About the Authors
For information or permission to reprint, please contact READY at [email protected] or ARCFUSION at [email protected]
To find the latest READY and ARCFUSION updates and content visit us at readyms.com and archfusion.ai.
Follow READY and ARCFUSION on LinkedIn.
© Ready Management Solutions 2025. All rights reserved.
© ArcFusion 2025. All rights reserved.
About Ready
Ready is a consulting agency committed to providing innovative solutions to address operational and technological needs. With a focus on strategy, automation, and enablement, Ready specializes in offering forward-looking solutions for the modern customer. With operations in the United States, Philippines, Australia, and Thailand, and plans to expand further, Ready is set to become a global force in the consulting world.
Share