It runs on my machine 04

Previously

These posts document my attempt at replicating the functionality of the llama.vim vim/nvim plugin to work in the context of my IDE of choice: PHPStorm and, by extension, other IntelliJ IDEs.
The name of the plugin I'm developing is "completamente"; "completely" in English, and a word-play that in Italian breaks down ot "complete" and "mind". So, something about intelligence and completion.
Naming things is hard.

In the first post, I managed to get a basic "Hello World!" inline completion working in the IDE. The completion was hardcoded, but the foundation was there - I could tap into the IDE's completion API and show suggestions to the user.

In the second post, I connected the plugin to a real llama.cpp server, implementing settings for endpoint configuration, making HTTP requests to the infill endpoint, and adding error handling with user notifications. The plugin started providing actual LLM-powered completions. But the implementation was messy and disorganized.

In the third post, I cleaned up that mess. I extracted the HTTP logic into a dedicated InfillHttpClient class, created proper response types with sealed classes, and wrote tests using an HTTP test server.

It's time to delve into the completion logic and request to try and replicate some of the unbelievably good suggestion mechanic I get while using the llama.vim plugin.

The problem with the current approach

Right now, when the user triggers a completion, the plugin does this:

val prefix = text.take(offset)
val suffix = text.substring(offset)

That's it. It sends the entire file before the cursor as the prefix, and the entire file after the cursor as the suffix.

This approach has two main issues:

  1. size - LLMs have context windows. Sending entire files runs the risk of overloading the context window. Depending on the model and request setting this could cause the context to be cut to fit in a manner that is beyond the user control.
  2. performance - more tokens means more time to process, a critical factor in the quality-of-life of a completion plugin. Getting the best result with as little tokens as possible would be ideal.

The first remedy to this approach lies in allowing the user (i.e. me) to set a context window of lines around the cursor position. The idea of using lines is not mine, it comes from the llama.vim plugin I'm using as my implementation guide.

Putting the window logic into place

I added two new settings to the Settings.State class: nPrefix (default 256) and nSuffix (default 64), representing the number of lines to include before and after the cursor. I then updated the settings UI to expose these values with text fields and a description label.

The interesting part is how to use these settings. The challenge is that the IntelliJ API gives me a character offset (the cursor position as a count of characters from the start of the file), but I need to work with line numbers to implement the window logic.

Here's what the code looks like in the Service class:

override suspend fun getSuggestion(request: InlineCompletionRequest): InlineCompletionSuggestion {
    val document = request.document
    val offset = request.startOffset
    val settings = Settings.getInstance()
    val state = settings.state

    // Handle empty documents
    if (document.lineCount == 0 || document.textLength == 0) {
        return StringSuggestion("")
    }

    // Get the current line number from the offset
    val currentLine = document.getLineNumber(offset)

    // Calculate the line range for prefix and suffix
    val prefixStartLine = maxOf(0, currentLine - state.nPrefix)
    val suffixEndLine = minOf(document.lineCount - 1, currentLine + state.nSuffix)

    // Get the start offset for the prefix window
    val prefixStartOffset = document.getLineStartOffset(prefixStartLine)

    // Get the end offset for the suffix window
    val suffixEndOffset = document.getLineEndOffset(suffixEndLine)

    // Extract prefix: from the start of the window to the cursor
    val prefix = document.text.substring(prefixStartOffset, offset)

    // Extract suffix: from the cursor to the end of the window
    val suffix = document.text.substring(offset, suffixEndOffset)

    // Get the completion from the LLM
    val completion = getCompletion(prefix, suffix)

    return StringSuggestion(completion)
}

The offset parameter is a character offset - the position of the cursor counting every character from the start of the file. If I have a PHP file that starts like this:

|<?php
function hello() {
    echo "world";
}

Then offset 0 is before the < (cursor position shown with |). Offset 3 would be:

<?p|hp
function hello() {
    echo "world";
}

And offset 10 would be at the start of the second line, after the newline:

<?php
|function hello() {
    echo "world";
}

The code converts this character offset to a line number using document.getLineNumber(offset), then calculates which lines to include based on nPrefix and nSuffix.

For example, with a TypeScript file like this where the cursor is at offset 45:

interface User {
    name: string;
    email: string;
}

function greet(user: User) {
    console.log(`Hello, ${user.n|ame}`);
}

If nPrefix is 256 and nSuffix is 64, the code would:

  1. Determine the cursor is on line 6 (counting from 0)
  2. Calculate prefixStartLine = max(0, 6 - 256) = 0
  3. Calculate suffixEndLine = min(7, 6 + 64) = 7 (the document only has 8 lines)
  4. Get the character offset for the start of line 0
  5. Get the character offset for the end of line 7
  6. Extract prefix from the start of line 0 to offset 45
  7. Extract suffix from offset 45 to the end of line 7

The empty document check at the beginning is there because the IntelliJ API throws IndexOutOfBoundsException when you try to get line offsets from an empty document. I discovered this when the tests failed with "Wrong line: -1. Available lines count: 0". When document.lineCount is 0, the expression document.lineCount - 1 evaluates to -1, which is invalid.

What does llama.vim actually send?

Before moving forward, I wanted to verify that the request sent by my implementation matches the one sent by the llama.vim plugin.
I started the llama.cpp server in verbose mode:

llama-server --fim-qwen-3b-default --port 8012 --verbose

Then I opened the TypeScript example file in nvim, positioned the cursor after the "n" in user.n|ame, and triggered a completion. The server logged this JSON payload (I've removed fields that aren't relevant for now):

{
  "input_suffix": "ame}`);\n}\n",
  "input_prefix": "interface User {\n    name: string;\n    email: string;\n}\n\nfunction greet(user: User) {\n",
  "prompt": "    console.log(`Hello, ${user.n"
}

Three fields:

  • input_prefix: Everything before the cursor
  • input_suffix: Everything after the cursor
  • prompt: The current line up to the cursor position

My plugin sends only input_prefix and input_suffix. It's missing the prompt field.

The llama.cpp server documentation mentions the prompt field and provides this explanations:

prompt: Added after the FIM_MID token

With reference to the Qwen 2.5 Coder paper it should and the following fill-in-the-middle structure:

<|fim_prefix|>{code_pre}<|fim_suffix|>{code_suf}<|fim_middle|>{code_mid}<|endoftext|>

the prompt will map to the {code_mid} section of the template.
At this stage, I trust the implementation of the llama.vim plugin to be the best one; I will tinker with this in later iterations.

Adding the prompt field

In the getSuggestion method, I need to extract the current line up to the cursor:

// Extract prompt: the current line up to the cursor
val currentLineStartOffset = document.getLineStartOffset(currentLine)
val prompt = document.text.substring(currentLineStartOffset, offset)

Then update the getCompletion method to accept the prompt parameter:

private suspend fun getCompletion(prefix: String, suffix: String, prompt: String): String {

And include it in the request body:

val requestBody = JSONObject()
requestBody.put("input_prefix", prefix)
requestBody.put("input_suffix", suffix)
requestBody.put("prompt", prompt)

I added a test using the TypeScript example file, positioning the cursor at offset 118 (after user.n on line 6). But this time, instead of making real HTTP requests, I updated the tests to use the TestHttpServer mock that I had created for testing the HTTP client.

The test now captures the actual request payload and verifies that all three fields are sent correctly:

var capturedRequestBody: String? = null

// Respond with a completion content of "ame" to the request.
server.captureRequestAndRespond(200, """{"content": "ame"}""") { _, _, body ->
    capturedRequestBody = body
}

val request = makeInlineCompletionRequest("example.ts", 118, 118)
val service = Service()

val suggestion: InlineCompletionSuggestion = runBlocking { service.getSuggestion(request) }

// Verify the request payload using triple-quoted strings
val requestJson = JSONObject(capturedRequestBody!!)

val expectedPrefix = """interface User {
    name: string;
    email: string;
}

function greet(user: User) {
    console.log(`Hello, ${'$'}{user.n"""

val expectedSuffix = """ame}`);
}
"""

val expectedPrompt = "    console.log(`Hello, \${user.n"

assertEquals(expectedPrefix, requestJson.getString("input_prefix"))
assertEquals(expectedSuffix, requestJson.getString("input_suffix"))
assertEquals(expectedPrompt, requestJson.getString("prompt"))

Matching the complete request format

After verifying the prompt field was being sent correctly, I wanted to see what else the llama.vim plugin includes in its requests. I've captured a complete request from llama.vim before:

{
  "input_suffix": "ame}`);\n}\n",
  "input_prefix": "interface User {\n    name: string;\n    email: string;\n}\n\nfunction greet(user: User) {\n",
  "top_p": 0.99,
  "input_extra": [],
  "t_max_prompt_ms": 500,
  "samplers": ["top_k", "top_p", "infill"],
  "n_predict": 128,
  "n_indent": 4,
  "t_max_predict_ms": 3000,
  "stream": false,
  "top_k": 40,
  "prompt": "    console.log(`Hello, ${user.n",
  "cache_prompt": true
}

My plugin was only sending three fields.
The rest of the fields control sampling behavior, performance limits, and other aspects of the completion request. I decided to add the most important ones as configurable settings:

  • tMaxPromptMs (default 500): Max alloted time for the prompt processing
  • tMaxPredictMs (default 1000): Max alloted time for the prediction
  • nPredict (default 128): Max number of tokens to predict

I added these to the Settings.State class and updated the UI to include them, each with a description. The rest of the fields (top_p, top_k, samplers, n_indent, stream, cache_prompt, input_extra) I've hard-coded to match llama.vim's values.

I will likely revisit this decision in the future, but it's good enough for now.

The request building code now looks like this:

val requestBody = JSONObject()
requestBody.put("input_prefix", prefix)
requestBody.put("input_suffix", suffix)
requestBody.put("prompt", prompt)
requestBody.put("input_extra", emptyList<String>())
requestBody.put("top_p", 0.99)
requestBody.put("top_k", 40)
requestBody.put("t_max_prompt_ms", state.tMaxPromptMs)
requestBody.put("t_max_predict_ms", state.tMaxPredictMs)
requestBody.put("n_predict", state.nPredict)
requestBody.put("n_indent", 4)
requestBody.put("stream", false)
requestBody.put("cache_prompt", true)
requestBody.put("samplers", listOf("top_k", "top_p", "infill"))

I updated the tests to verify all the request parameters are being sent correctly. The testGetSuggestionWithTypeScriptFile test now checks not just the three main fields, but also all the additional parameters:

// Verify the additional request parameters
assertTrue(requestJson.has("input_extra"))
assertEquals(0, requestJson.getJSONArray("input_extra").length())

assertTrue(requestJson.has("top_p"))
assertEquals(0.99, requestJson.getDouble("top_p"), 0.001)

assertTrue(requestJson.has("top_k"))
assertEquals(40, requestJson.getInt("top_k"))

assertTrue(requestJson.has("t_max_prompt_ms"))
assertEquals(500, requestJson.getInt("t_max_prompt_ms"))

// ... and so on for all fields

I also added a test to verify that changing settings affects the request parameters:

fun testSettingsAffectRequestParameters() {
    val settings = Settings.getInstance()
    settings.state.tMaxPromptMs = 1000
    settings.state.tMaxPredictMs = 2000
    settings.state.nPredict = 256

    // ... make request and capture body ...

    val requestJson = JSONObject(capturedRequestBody!!)
    assertEquals(1000, requestJson.getInt("t_max_prompt_ms"))
    assertEquals(2000, requestJson.getInt("t_max_predict_ms"))
    assertEquals(256, requestJson.getInt("n_predict"))
}

The test setup now saves and restores all settings using state.copy() in setUp() and loadState() in tearDown(), ensuring tests don't affect each other.

After these updates, ./gradlew build succeeded with all tests passing.

Next

While there are a number of features missing to approach parity with the llama.vim plugin, in my next post I will concentrate on request optimization: debounce requests, cancel the inflight ones and avoid them entirely in some instances.