It runs on my machine 06

November 13, 2025

Last time I got request cancellation working with Ktor. The plugin no longer hangs when I type quickly, which is good. But there's a more fundamental problem: the plugin is still firing a request on every single keystroke.

That's wasteful. If I type "function hello" quickly, I don't need eight separate requests - I need one request after I pause. And sometimes I don't need a request at all.

Time to add debouncing and request avoidance, following what llama.vim does.

Digging into llama.vim's logic

I opened up the llama.vim file to see how the original plugin handles this. There are two main strategies:

Debouncing: When a request is already in flight and a new one comes in, don't send it immediately.
Instead, start a 100ms timer. If another request comes in before the timer fires, restart the timer. This means requests only go out after a brief pause in typing.

Here's the relevant code from llama.vim (lines 555-563):

" avoid sending repeated requests too fast
if s:current_job != v:null
    if s:timer_fim != -1
        call timer_stop(s:timer_fim)
        let s:timer_fim = -1
    endif

    let s:timer_fim = timer_start(100, {-> llama#fim(a:pos_x, a:pos_y, v:true, a:prev, a:use_cache)})
    return
endif

Request avoidance: Don't send requests in certain situations where completions are unlikely to be useful. The main one is max_line_suffix - if the cursor has too much text to the right of it, skip the request. The default is 8 characters.

From llama.vim lines 579-581:

if a:is_auto && len(l:ctx_local['line_cur_suffix']) > g:llama_config.max_line_suffix
    return
endif

Makes sense. If I'm in the middle of a line with "function hello(name) {" and my cursor is after "hello", I probably don't want a completion suggestion. I'm editing existing code, not writing new code.

Let's start with the easier piece: adding the maxLineSuffix setting.

Using maxLineSuffix to skip requests

After adding a new setting to set the maxLineSuffix value (I'm not showing that anymore since it's boring and looks like 1995 table-based HTML).

Now I need to implement the actual logic that checks the line suffix length and skips the request if it's too long.

I added this check in Service.getSuggestion() right after calculating currentLine, before any of the expensive prefix/suffix extraction or HTTP request logic:

// Skip request if line suffix is too long
val currentLineEndOffset = document.getLineEndOffset(currentLine)
if (currentLineEndOffset - offset > state.maxLineSuffix) {
    return StringSuggestion("")
}

Simple calculation: the number of characters from the cursor to the end of the line. If it exceeds maxLineSuffix, return an empty suggestion without making a request.

This broke one of the existing tests because it was positioned at the start of a line with 23 characters - way over the default limit of 8. I updated the test setup to set maxLineSuffix = 1000 so existing tests aren't affected by this new behavior.

Then I added a specific test to verify the logic works:

fun testMaxLineSuffixSkipsRequest() {
    // Set maxLineSuffix to 5 characters
    val settings = Settings.getInstance()
    settings.state.maxLineSuffix = 5

    // Configure server - but we expect NO request to be made
    var requestMade = false
    server.captureRequestAndRespond(200, """{"content": "should not see this"}""") { _, _, _ ->
        requestMade = true
    }

    // Position cursor at offset 0 of "text-file.txt"
    // The line is "This is a test file." - 23 characters to the right
    // This exceeds maxLineSuffix of 5, so request should be skipped
    val request = makeInlineCompletionRequest("text-file.txt", 0, 0)
    val service = Service()

    val suggestion = runBlocking { service.getSuggestion(request) }

    // Should return empty suggestion without making a request
    assertFalse(requestMade)
}

The key assertion is assertFalse(requestMade) - verifying the HTTP request is never sent when the line suffix is too long.

So now the plugin has basic request avoidance: if I'm editing in the middle of a line with too much text to the right, it won't bother asking the LLM for a completion. That should cut down on unnecessary requests.

Implementing debouncing

The bigger problem remains: the plugin still fires a request on every keystroke when conditions allow it. I need the debouncing logic: the 100ms delay that waits for a pause in typing.

Looking at llama.vim's debouncing code, the delay only happens when a request is already in flight:

if s:current_job != v:null
    if s:timer_fim != -1
        call timer_stop(s:timer_fim)
        let s:timer_fim = -1
    endif

    let s:timer_fim = timer_start(100, {-> llama#fim(...)})
    return
endif

In plain English:

If there's a job running, schedule a timer and return.

So the logic is: don't immediately cancel the in-flight request. Give it 100ms to finish. If I'm still typing after 100ms, then start a new request (which will cancel the old one).

But does llama.vim even have request cancellation? Looking further down in the code:

if s:current_job != v:null
    if s:ghost_text_nvim
        call jobstop(s:current_job)
    elseif s:ghost_text_vim
        call job_stop(s:current_job)
    endif
endif

Yes, it does. It cancels the previous job before sending a new request. So the 100ms delay is not a workaround for lack of cancellation: it's a strategy to avoid cancelling too aggressively. Fast requests get a chance to finish naturally instead of being cancelled immediately.

The complete strategy is:

If request is in flight then wait 100ms (give current request a chance to finish).
If still need to send request then cancel any leftover job, then send new request.

I already have the cancellation part via Ktor's coroutine cancellation. I just need to add the "wait if busy" logic.

First, I added a method to InfillHttpClient to check if a request is in flight:

fun isRequestInFlight(): Boolean {
    return currentJob?.isActive == true
}

Then I added the debouncing check at the start of getCompletion():

private suspend fun getCompletion(prefix: String, suffix: String, prompt: String): String {
    return withContext(Dispatchers.IO) {
        try {
            // Debouncing: if a request is already in flight, wait 100ms.
            // This gives the current request time to finish before we cancel it.
            if (httpClient?.isRequestInFlight() == true) {
                delay(100)
            }

            // ... rest of existing logic ...
        }
    }
}

The plugin now matches llama.vim's debouncing strategy. This should significantly reduce the number of wasted requests when I'm typing quickly.

Handling cancellation gracefully

Testing the plugin in the IDE, I noticed an error notification about the request being cancelled.

The debouncing is working: requests are being cancelled when I type quickly. But the code is treating cancellation as an error and showing notifications. Not great.

The problem is in the error handling. When a coroutine is cancelled, it throws CancellationException. The catch block was catching it as a generic Exception:

} catch (e: Exception) {
    logger.warn("Failed to get completion", e)
    showErrorNotification("Failed to connect to LLM endpoint: ${e.message}")
    ""
}

Cancellation is not an error - it's normal during debouncing. I need to handle it separately.

I added a specific catch block for CancellationException before the generic handler:

} catch (e: CancellationException) {
    // Request was cancelled - this is normal during debouncing, return empty silently
    ""
} catch (e: Exception) {
    logger.warn("Failed to get completion", e)
    showErrorNotification("Failed to connect to LLM endpoint: ${e.message}")
    ""
}

Now when requests get cancelled (which happens frequently when typing), it just returns an empty suggestion silently. Only real connection or server errors show notifications.

I added a test to verify this behavior:

fun testCancellationDoesNotThrowError() {
    // Configure server with a slow response to ensure requests overlap.
    server.respondWithDelay(200, 200, """{"content": "slow completion"}""")

    val service = Service()
    val request = makeInlineCompletionRequest("text-file.txt", 0, 0)

    // Fire multiple rapid requests - later ones should cancel earlier ones.
    // This should NOT throw any exceptions or show error notifications.
    runBlocking {
        val job1 = launch { service.getSuggestion(request) }
        delay(10)
        val job2 = launch { service.getSuggestion(request) }
        delay(10)
        val job3 = launch { service.getSuggestion(request) }

        // Wait for all jobs to complete
        job1.join()
        job2.join()
        job3.join()
    }

    // If we get here without exceptions, cancellation was handled gracefully.
    assertTrue(true)
}

The test fires three rapid requests with a slow server response. The second and third requests should trigger cancellation of earlier ones. If the CancellationException handler works correctly, the test passes without any exceptions.

Much better. The plugin now handles the complete debouncing flow without annoying the user with false error messages.

What's next

The plugin now has basic efficiency features: it skips requests when editing in the middle of lines, and it debounces to avoid firing too many requests while typing. But there's still room for improvement.

llama.vim has a caching system that avoids re-requesting completions for contexts it's already seen I will tackle and reproduce in my next post.