It runs on my machine 07

Last time I added debouncing and request avoidance to the plugin. It's not firing a request on every keystroke anymore, but I've been noticing something odd when using it for real work.

The completions work, technically. The LLM responds, the plugin receives the response, and it shows up in the editor. But the formatting is often off. Extra newlines where they shouldn't be. Weird spacing. Sometimes the completion just stops mid-word.

I've been treating the response from llama.cpp as a simple string: just grab the content field from the JSON and show it. That worked for basic "hello world" testing, but now that I'm using it for real code, the cracks are showing.

Of course, llama.vim plugin handling of the response is way more refined than my current naïve approach.

How llama.vim processes responses

Looking at the llama.vim code, I found the s:fim_render function where the completion response is processed.
Here's the relevant part:

" get the generated suggestion
if l:can_accept
    let l:response = json_decode(l:raw)

    for l:part in split(get(l:response, 'content', ''), "\n", 1)
        call add(l:content, l:part)
    endfor

    " remove trailing new lines
    while len(l:content) > 0 && l:content[-1] == ""
        call remove(l:content, -1)
    endwhile

The third parameter 1 in split() means "keep empty strings". So if the content is "foo\n\nbar", it becomes ["foo", "", "bar"] instead of ["foo", "bar"].

So llama.vim splits on newlines (keeping empty lines), then removes trailing empty lines. I'm just taking the raw string and showing it directly.

Implementing the parsing logic

I added a parseContent function to handle this in Service.kt:

/**
 * Parse the content from the LLM response.
 * Splits on newlines and removes trailing empty lines, matching llama.vim behavior.
 */
private fun parseContent(rawContent: String): String {
    if (rawContent.isEmpty()) {
        return ""
    }

    // Split on newlines, keeping empty strings (matching llama.vim's split with keepempty=1)
    val lines = rawContent.split("\n")

    // Remove trailing empty lines
    val trimmedLines = lines.dropLastWhile { it.isEmpty() }

    return trimmedLines.joinToString("\n")
}

Kotlin's split() already keeps empty strings by default, so that part was simpler than Vim. The dropLastWhile { it.isEmpty() } removes trailing empty lines, matching llama.vim's while loop.

Then I updated the response handling to use this function:

is HttpResponse.Success -> {
    val jsonResponse = JSONObject(response.body)
    val rawContent = jsonResponse.optString("content", "")
    logger.info("Got completion: $rawContent")

    // Parse the content: split on newlines and remove trailing empty lines
    parseContent(rawContent)
}

Testing the parsing

I wrote four tests to cover the different edge cases:

fun testContentParsingRemovesTrailingNewlines() {
    // Configure server to return content with trailing newlines
    server.respondWith(200, """{"content": "function hello() {\n    return 'world';\n}\n\n\n"}""")

    val request = makeInlineCompletionRequest("text-file.txt", 0, 0)
    val service = Service()

    val suggestion: InlineCompletionSuggestion = runBlocking { service.getSuggestion(request) }
    val variants: List<InlineCompletionVariant> = runBlocking { suggestion.getVariants() }

    // Should remove trailing newlines
    assertEquals("function hello() {\n    return 'world';\n}", runBlocking { variants.first().elements.first().text })
}

The other tests cover: preserving empty lines in the middle, handling empty strings, and handling content that's only newlines.

Once all tests pass, I can move to the next part of the work.

How llama.vim caches completions

With the parsing fixed, I'm curious about the caching system I saw mentioned earlier. My plugin fires a lot of requests, even with debouncing. Finding out how the llama.vim plugin avoids re-requesting completions for the the same context and porting that logic over to my plugins would be a big win.

llama.vim uses two global variables for caching:

let g:cache_data = {}
let g:cache_lru_order = []

cache_data is a dictionary mapping hash keys to raw JSON responses. cache_lru_order is an array tracking which keys were used most recently, for eviction when the cache fills up. The cache has a configurable size limit (max_cache_keys, default 250). When it's full, the least recently used entry gets evicted.

The interesting part is how llama.vim generates cache keys. In the llama#fim function it computes multiple hashes for the same request:

let l:hashes = []

call add(l:hashes, sha256(l:prefix . l:middle . 'Î' . l:suffix))

let l:prefix_trim = l:prefix
for i in range(3)
    let l:prefix_trim = substitute(l:prefix_trim, '^[^\n]*\n', '', '')
    if empty(l:prefix_trim)
        break
    endif

    call add(l:hashes, sha256(l:prefix_trim . l:middle . 'Î' . l:suffix))
endfor

The primary hash is sha256(prefix + middle + 'Î' + suffix). The 'Î' character is just a separator. But then it creates up to 3 additional hashes by progressively removing the first line from the prefix. The comment explains why: "this happens when we have scrolled down a bit from where the original generation was done". So if I get a completion at line 100, then scroll down to line 103, the plugin can still use that cached completion by trimming the first 3 lines.

Before making an HTTP request, llama.vim checks if any of the computed hashes are already cached:

if a:use_cache
    for l:hash in l:hashes
        if s:cache_get(l:hash) != v:null
            return
        endif
    endfor
endif

If there's a cache hit, it returns immediately without making a request. When a response comes back from the server, it's stored in the cache with all the computed hashes:

" put the response in the cache
for l:hash in a:hashes
    call s:cache_insert(l:hash, l:raw)
endfor

So one response gets stored under multiple keys. This increases cache hit rates when scrolling through code.

The most interesting code is in s:fim_try_hint. If there's no exact cache hit, it searches backwards through the last 128 characters to find a "nearby" cached completion:

for i in range(128)
    let l:removed = l:pm[-(1 + i):]
    let l:ctx_new = l:pm[:-(2 + i)] . 'Î' . l:suffix

    let l:hash_new = sha256(l:ctx_new)
    let l:response_cached = s:cache_get(l:hash_new)
    if l:response_cached != v:null
        if l:response_cached == ""
            continue
        endif

        let l:response = json_decode(l:response_cached)
        if l:response['content'][0:i] !=# l:removed
            continue
        endif

        let l:response['content'] = l:response['content'][i + 1:]
        if len(l:response['content']) > 0
            " ... use this cached response
        endif
    endif
endfor

This loops through positions 1-128 characters back. For each position, it generates the hash for that earlier context, checks if there's a cached response, verifies that the first i characters of the cached response match what was typed since then, trims those characters off, and uses the remaining text as the completion.

Example: Say I'm at "function hello" and the cache has a completion for "function h" that returns "ello() { ... }". When I check position 4 back, I find that cached response, verify "ello" matches what I typed, trim it off, and show "() { ... }" as the completion. This is very efficient (and what really surprised me in the plugin when I used it): I get instant completions while typing, even though the LLM hasn't seen my latest keystrokes yet.

Implementing the cache

I created a ResponseCache class to handle the caching logic:

class ResponseCache(private val maxSize: Int = 250) {
    private val cacheData = mutableMapOf<String, String>()
    private val lruOrder = mutableListOf<String>()

    fun generateHash(prefix: String, middle: String, suffix: String): String {
        val combined = prefix + middle + "Î" + suffix
        val digest = MessageDigest.getInstance("SHA-256")
        val hashBytes = digest.digest(combined.toByteArray())
        return hashBytes.joinToString("") { "%02x".format(it) }
    }

    fun generateHashes(prefix: String, middle: String, suffix: String): List<String> {
        val hashes = mutableListOf<String>()

        // Primary hash
        hashes.add(generateHash(prefix, middle, suffix))

        // Additional hashes with trimmed prefix (up to 3 lines removed)
        var prefixTrim = prefix
        for (i in 0 until 3) {
            val newlineIndex = prefixTrim.indexOf('\n')
            if (newlineIndex == -1) break

            prefixTrim = prefixTrim.substring(newlineIndex + 1)
            if (prefixTrim.isEmpty()) break

            hashes.add(generateHash(prefixTrim, middle, suffix))
        }

        return hashes
    }
}

I realized something important while implementing this. The llama.vim plugin stores full JSON responses in its cache because it needs to manipulate the JSON structure: it decodes cached responses, modifies the content field, and re-encodes them as JSON for rendering purposes (metrics, times et cetera). This makes sense for vim's rendering system.

In the context of my IntelliJ plugin implementation, though, I'm not doing any of that JSON manipulation. When I find a cache hit (exact or partial), I work directly with the content string. I never need to re-encode anything as JSON.

This means I can store only the parsed content string, not the full JSON response. When I cache a completion, I extract the content field from the JSON response, parse it (remove trailing newlines), and store only that final string. This is more memory-efficient and simpler: no need to store and parse JSON on every cache hit.

In Service.getSuggestion(), I check the cache before making a request:

// Check cache for exact match (including trimmed prefix variations)
val hashes = cache.generateHashes(prefix, middle, suffix)
for (hash in hashes) {
    val cachedContent = cache.get(hash)
    if (cachedContent != null) {
        logger.info("Cache hit for hash: $hash")
        return StringSuggestion(cachedContent)
    }
}

And when a response comes back from the server, I store it with all the generated hashes:

is HttpResponse.Success -> {
    // Parse the JSON response and extract the content
    val jsonResponse = JSONObject(response.body)
    val rawContent = jsonResponse.optString("content", "")
    logger.info("Got completion: $rawContent")

    // Parse the content: split on newlines and remove trailing empty lines
    val parsedContent = parseContent(rawContent)

    // Store only the parsed content string in cache with all hashes
    for (hash in hashes) {
        cache.put(hash, parsedContent)
    }

    parsedContent
}

I wrote three tests to verify the caching works:

/**
 * This test verifies that the cache prevents duplicate HTTP requests
 * when the same context is requested twice.
 */
fun testCachingAvoidsSecondRequest() {
    var requestCount = 0

    server.captureRequestAndRespond(200, """{"content": "cached completion"}""") { _, _, _ ->
        requestCount++
    }

    val request = makeInlineCompletionRequest("text-file.txt", 0, 0)
    val service = Service()

    // First request - should hit the server
    runBlocking { service.getSuggestion(request) }
    assertEquals(1, requestCount)

    // Second request with same context - should use cache
    runBlocking { service.getSuggestion(request) }
    assertEquals(1, requestCount) // Still 1
}

Implementing partial cache hits

With exact caching working, there's still one more optimization from llama.vim I want to implement: partial cache hits.

The idea is simple but clever. And not mine!
Say I'm at "function h" and the LLM returns "ello() { }". I accept the suggestion and keep typing.
Now I'm at "function hello": instead of making a new request, I can look back in my recent keystrokes, find that I have a cached response for "function h" that starts with "ello", trim off the "ello" I already typed, and show "() { }" as the completion.

Looking at llama.vim's s:fim_try_hint function, it searches backwards through the last 128 characters.
For each position, it checks if there's a cached response for that earlier context and whether the first N characters of that response match what was typed since then. If so, it trims those characters off and uses the rest.

Once implemented into the Service.kt file, the findPartialCacheHit function is ready to be used:

// Check for partial cache hit by searching backwards through recent keystrokes
val partialHit = findPartialCacheHit(prefix, middle, suffix)
if (partialHit != null) {
    logger.info("Partial cache hit found")
    return StringSuggestion(partialHit)
}

I added this right after the exact cache check fails and before making an HTTP request. This way the plugin tries exact matches first (fastest), then partial matches (fast), then falls back to the LLM (slow).

Testing partial cache hits

I updated the tests to properly simulate the typing scenario and test the partial cache hit logic:

fun testPartialCacheHit() {
    var requestCount = 0
    val service = Service()

    // First request at "function h" - this will be cached
    myFixture.configureByText("test.txt", "function h")
    myFixture.editor.caretModel.moveToOffset(10)

    server.captureRequestAndRespond(200, """{"content": "ello() { }"}""") { _, _, _ ->
        requestCount++
    }

    val request1 = makeInlineCompletionRequest("test.txt", 10, 10)
    val suggestion1 = runBlocking { service.getSuggestion(request1) }
    val variants1 = runBlocking { suggestion1.getVariants() }
    assertEquals("ello() { }", runBlocking { variants1.first().elements.first().text })
    assertEquals(1, requestCount)

    // Now simulate typing "ello" by reconfiguring with the full text
    myFixture.configureByText("test.txt", "function hello")
    myFixture.editor.caretModel.moveToOffset(14)

    // This should NOT make a new HTTP request - it should use partial cache hit
    val request2 = makeInlineCompletionRequest("test.txt", 14, 14)
    val suggestion2 = runBlocking { service.getSuggestion(request2) }
    val variants2 = runBlocking { suggestion2.getVariants() }

    // Should return "() { }" (the cached "ello() { }" with "ello" trimmed off)
    assertEquals("() { }", runBlocking { variants2.first().elements.first().text })
    assertEquals(1, requestCount) // Still only 1 request - proves partial hit worked
}

I'm using configureByText() to set up the complete document state for each request, rather than trying to simulate typing with type(), for the simple reason that I could not make it work reliably.
The test verifies that only one HTTP request is made, proving the second completion came from the partial cache hit.

What's next

The plugin now has all the core features: debouncing, request avoidance, response parsing, full caching with LRU eviction, and partial cache hits. It's starting to feel responsive when coding.

Next time I'll tackle the extra input, i.e. the next bright idea of the llama.vim plugin, that provides context for the completions.