Llama Index Streaming Cheat Sheet

Quick Reference Guide for Streaming LLM Responses in LLamaIndex

Dec 31, 2024

In this brief guide, I present several examples of how to utilize streaming responses in LLamaIndex.

This is largely motivated by my mild frustration in searching for official guides on using streaming responses. While I would typically jot down my findings in Notion, I believe sharing this publicly could benefit others in the future.

Sync Response Streaming

In this example we use a QueryEngine instance to get the response in streaming format. If you are using ChatEnginer, Agent etc. the format remains the same.

python

query_engine = index.as_query_engine(streaming=True, similarity_top_k=1)

Once configured, you can print your response as follows:

python

streaming_response = query_engine.query("Is 2024 a leap year?")

for text in streaming_response.response_gen:

print(text, end="", flush=True)

Async Response Streaming

Similarly, asynchronous response streaming can be implemented as follows:

python

query_engine = index.as_query_engine(streaming=True, similarity_top_k=1)

python

streaming_response = await query_engine.astream_chat("Is 2024 a leap year?")

async for text in streaming_response.async_response_gen():

print(text, end="", flush=True)

That's it! I hope you found this information useful. I plan to submit a pull request to update the documentation with more examples on streaming. Wishing you all a Happy New Year!

CRUD Flow

Discussion about this post