
Edge AI using LLMs
Active ProjectMachine Learning
This project explores the deployment of Large Language Models (LLMs) on the NVIDIA Orin Nano Super — a powerful yet compact edge computing platform. The goal is to assess performance, latency, and real-world usability of AI inference workloads in constrained environments. It combines hardware acceleration with software optimization to demonstrate intelligent edge solutions capable of processing natural language locally, without relying on cloud infrastructure. This research provides foundational insight for building secure, low-latency AI systems for robotics, IoT, and offline applications.
5
Total Feedback
Dec 2
Last Updated
Key Features
HardwareAI/MLSoftware
Tech Stack
LLMNVIDIA Jetson Orin NanoONNX Runtime / TensorRTPython+5 more
Image Gallery

EdgeSense - Chat Window

EdgeSense - Available and Download Models

EdgeSense - System Performance

EdgeSense - Setting Page

EdgeSense - API Server

EdgeSense - MCP Server

v0.2.0 Working Resourced and Selectable models

v0.1.0 Ollama front end with markdown rendering

V0.1.0 Ollama front end working

TensorFlow and Ollama install complete

Nvidia Jetson Orin Nano
Loading updates...
Project Roadmap
Loading timeline...
Upcoming Features
As a user, I want to be able to securely connect to the Orin Nano via SSH and RDC to facilitate remote installation, testing, and debugging. This will affect the system's network security configurations and require updates to the deployment script (`deploy_and_run.sh`) to automate the setup of these services.Planned
Medium Priority
As a user, I will be able to enhance my research by using a new search function that integrates with the existing AI model. This feature will allow me to input queries that the AI can use to retrieve and provide additional contextual data from external sources, improving the depth and relevance of the responses I receive. This will affect the main application interface, specifically by introducing a new search bar component on the homepage and displaying search results in a designated section alongside the existing model outputs.Planned
Medium Priority
As a user, I want to be able to add, view, and edit textual documentation for each model because it will provide additional context and insights, making model management more informative and efficient. This will affect the model management interface, specifically introducing a new component for text input and display within the model entries.Planned
Medium Priority
New button on home page. Completed
High Priority
Known Issues
No known issues
Documents & Files
EdgeSense v0.1 MacOS Application
Install Document - ReadMe
Project Challenges
Challenge 1: Real-time System Monitoring
The first challenge was getting live CPU, GPU, and RAM data from the Jetson into the browser. The
Challenge 2: Managing Memory on a Constrained Device
The Jetson Orin Nano is powerful, but its 8 GB of RAM can be quickly consumed by larger language models. Early in testing, I found the system would become unresponsive or even freeze if I tried to load a model that was too large for the available memory.
Challenge 3: Smoothly Streaming Chat Responses
I wanted the chat to feel interactive, with the model's response appearing token by token, just like in popular applications. Ollama's API supports streaming, but handling this correctly on the frontend was tricky. Initial implementations resulted in garbled text or the entire response appearing at once.
The first challenge was getting live CPU, GPU, and RAM data from the Jetson into the browser. The
tegrastats utility provides this information, but it's a command-line tool.Challenge 2: Managing Memory on a Constrained Device
The Jetson Orin Nano is powerful, but its 8 GB of RAM can be quickly consumed by larger language models. Early in testing, I found the system would become unresponsive or even freeze if I tried to load a model that was too large for the available memory.
Challenge 3: Smoothly Streaming Chat Responses
I wanted the chat to feel interactive, with the model's response appearing token by token, just like in popular applications. Ollama's API supports streaming, but handling this correctly on the frontend was tricky. Initial implementations resulted in garbled text or the entire response appearing at once.
Project Solutions & Learnings
Learning: I learned that for security reasons, a browser cannot directly execute local commands. The solution was to create the Python Flask "stats helper" API. This reinforced the principle of using a simple, dedicated microservice to bridge gaps between different parts of a system.
Learning: I gained a much deeper understanding of how to work with streaming APIs in React. It required careful state management to append new chunks of data to the existing message and re-render the component efficiently, creating the smooth, "typing" effect I was aiming for.
Learning: Resource management is paramount on edge devices. This led to the implementation of the "RAM guard-rail" feature. Before allowing a user to chat, the frontend fetches the model's size from the Ollama API and checks it against the free system RAM reported by the stats helper. If there isn't enough memory (with a safety margin), the chat input is disabled. This simple check dramatically improved the stability and user experience of the application.
Learning: I gained a much deeper understanding of how to work with streaming APIs in React. It required careful state management to append new chunks of data to the existing message and re-render the component efficiently, creating the smooth, "typing" effect I was aiming for.
Learning: Resource management is paramount on edge devices. This led to the implementation of the "RAM guard-rail" feature. Before allowing a user to chat, the frontend fetches the model's size from the Ollama API and checks it against the free system RAM reported by the stats helper. If there isn't enough memory (with a safety margin), the chat input is disabled. This simple check dramatically improved the stability and user experience of the application.