Posted on May 20, 2025 by c0psn — Leave a comment

Building a Sentiment-Aware RAG Pipeline in .NET with PDF and URL Ingestion

Introduction

In this blog post, we’ll walk through building a full-stack prototype that combines the power of Retrieval-Augmented Generation (RAG), keyword-level sentiment analysis, and OpenAI summaries using C#, Blazor Server, and Qdrant. We ingest data from both URLs and PDF files, perform analysis, and allow users to query a semantic vector store in real time.

Overview of the Workflow

This system provides:

Dual ingestion sources: users can upload PDFs or specify URLs.
Tokenization and embedding of page or chunk-level text.
Keyword extraction and sentiment analysis, both page-wide and per keyword.
Vector storage in Qdrant for semantic search.
Keyword-level sentiment summaries using OpenAI’s Chat API.
Blazor Server UI with charting for visual results.

Key Technologies Used

ML.NET: For running a local ONNX model to score sentiment.
Microsoft.ML.Tokenizers: For BERT-compatible tokenization.
Qdrant: Lightweight vector store with cosine similarity.
OpenAI .NET SDK (2.1.0): For keyword sentiment summaries.
Blazorise + Chart.js: For UI and interactive charting.
PdfPig: For extracting text from PDFs.

Architecture Breakdown

Services

SentimentAnalyzerService: Uses ONNX + tokenizer to get sentiment logits.
KeywordExtractorService: Extracts keyword frequencies from raw text.
KeywordContextSentimentService: Runs sentiment on keyword-local context windows.
TextChunker: Splits long documents into manageable chunks.
EmbeddingService: Uses OpenAI Embeddings API to get vector representation.
VectorStoreService: Talks to Qdrant and supports search + upsert.
KeywordSentimentSummaryService: Builds averaged keyword sentiment map and uses GPT to summarize it.

Controller Endpoints

api/rag/analyze: For URL ingestion and sentiment processing.
api/pdf/analyze: Accepts PDF uploads, runs chunk analysis, sentiment, and stores vectors.
api/rag/query: Accepts a question and performs similarity search against stored vectors.

Key Feature: Keyword Sentiment Summary

Once page-level and keyword-level sentiment scores are collected across all documents or PDFs, the system averages the keyword scores and sends them to GPT-4:

var prompt = BuildPromptFromAveragedSentiments(averaged);

var chatClient = _openai.GetChatClient("gpt-4");
var chatMessage = new List<ChatMessage>
{
    ChatMessage.CreateSystemMessage("You are an analyst. Provide a short summary of the keyword-level sentiment results."),
    ChatMessage.CreateUserMessage(prompt)
};

var response = await chatClient.CompleteChatAsync(chatMessage);

This gives a human-readable synthesis of what the model thinks about your documents—great for reporting or analysis.

UI Components

RAGAnalyzer.razor: Accepts URLs, performs analysis, and shows results.
UploadPdf.razor: Accepts PDFs, runs the same pipeline, and stores results.
Charts:
- PageSentimentChart
- KeywordSentimentChart
- KeywordChart
Query Component: Lets you search the vector DB by embedding your query and showing top chunks.

Example Use Case

Imagine you’re analyzing press releases, company filings, or product reviews. Paste in a few URLs or upload PDFs of documents, provide key terms like “revenue”, “climate”, or “safety”, and let the app:

Analyze how each keyword is viewed in each source.
Store those insights as vector chunks.
Summarize keyword sentiment.
Allow query access using natural language.

Sample Output:

Keyword Sentiment Summary:
- "profit": moderately positive
- "layoffs": highly negative
- "growth": slightly positive

Summary: Overall, the documents emphasize profitability with moderate optimism. Layoffs were discussed negatively, while growth projections are cautiously positive.

Lessons Learned

ONNX with ML.NET is a great choice for local sentiment without sending data to the cloud.
Qdrant is incredibly lightweight and fast for semantic similarity.
OpenAI summaries give clarity to otherwise raw logits and scores.
Blazor Server and Blazorise provide a robust UI pattern that feels modern and reactive.

Final Thoughts

This prototype illustrates the full power of combining local ML (via ONNX), semantic embeddings (via OpenAI), vector databases (via Qdrant), and interactive UI (via Blazor). It’s designed to be extensible, SOLID, and production-quality.

With this setup, you can:

Scale to PDFs, HTML pages, or any document corpus.
Add per-document classifiers, labels, or redaction.
Evolve into a complete enterprise knowledge extraction tool.

Let me know if you’d like to extend this to:

LLM completion directly over chunks
In-browser text highlighting
Source citation in query answers
Scheduled crawl + ingestion of new data

Happy hacking!

Posted on June 14, 2018June 14, 2018 by c0psn — Leave a comment

Time Drift Among Systems

Here we compare 3 different clocks against NTP over a period of 30 days using the Blaze Stone Time API.

Posted on June 8, 2018June 14, 2018 by c0psn — Leave a comment

Numerical Variance Among Processors and Operating Systems

Here we evaluate the results of the Polynomial Fit API for Windows (7), Linux (Ubuntu 14.04) and Raspbian.

Posted on March 8, 2015June 20, 2018 by c0psn — Leave a comment

HARDWARE ENCRYPTION

This idea fully exploits the configurable logic aspect of the 74aup2g57gm. This idea work similarly to AES encryption. Each configurable logic device would be act as a tumbler preceding the logic device before it. The opens and shorts would determine the resulting logic which would result in an unique code for that message. To further encrypt a GPS Time sync could be added to allow for a more obscure message. To decrypt the device would simply need to be inverted…so a message could be uniquely encoded using a basic ASCII binary representation. The hardware device would be the key for the message. so something like this could be realized. There’s probably a patent lurking somewhere within this one…. 🙂

USB -> NXP Encryption -> Computer

Posted on September 10, 2013June 20, 2018 by c0psn — Leave a comment

Looking at the data

the complete schematic is here:

Schematic for RFID1 simulation

Input and output for RFID1. It may be desirable to increase the DC offset or the amplitude of the output
But i’m going to go with this signal to start and work on refining the output after I breadboard the design.

So there are a few ways to increase the strength of the output signal while still maintaining the necessary modulation of the RFID card. I added a small load on the output. This did reduce the over all time the signal was ‘high’ but it would likely not matter much to the micro-controller. I increased and decreased the overall inductance and changed the ratios of a few resistors but mostly it led to an instability in the overall output of the signal. Coilcraft has a few nice 125 kHz RFID inductors on their site that I’m going to add to the BOM and the board layout.

My next step is to work on doing the PCB layout, routing, stack-up, geometry ect and get the board made. Then I’m going to build up and test a breadboard while I wait for the real parts to get in. I’m not going to buy a micro controller yet…I have an arduino device in mind but I’ll post another section on this hardware.

Posted on August 28, 2013June 20, 2018 by c0psn — Leave a comment

building the RFID schematic

Well I spent most of the night building the schematic for the RFID transmitter. Reverse engineering is always difficult at first but it is important and useful to build from. Today was also cool because I was able to go through my Time Robot’s magazine. This definitely gave me a few ideas of how and what to build in the future. 🙂

I ran into a few issues tonight. My VMware machine stopped working. I had to go into the program folder and delete the files ending in .Ick. After I did that I was free to work with the VM. There is some risk to the VM not being recoverable but I’m uploading all my files to the shared drive.

I also found a few datasheets to give an idea of what the circuit is doing and uploaded them as well. So after debugging here is my progress:

So after I complete the schematic I’ll go through the analysis for the analog and digital signals and see if we can improve this design at all 🙂