Llama Index Streaming Cheat Sheet
Quick Reference Guide for Streaming LLM Responses in LLamaIndex
In this brief guide, I present several examples of how to utilize streaming responses in LLamaIndex.
This is largely motivated by my mild frustration in searching for official guides on using streaming responses. While I would typically jot down my findings in Notion, I believe sharing this publicly could benefit others in the future.
Sync Response Streaming
In this example we use a QueryEngine instance to get the response in streaming format. If you are using ChatEnginer, Agent etc. the format remains the same.
python
query_engine = index.as_query_engine(streaming=True, similarity_top_k=1)
Once configured, you can print your response as follows:
python
streaming_response = query_engine.query("Is 2024 a leap year?")
for text in streaming_response.response_gen:
print(text, end="", flush=True)
Async Response Streaming
Similarly, asynchronous response streaming can be implemented as follows:
python
query_engine = index.as_query_engine(streaming=True, similarity_top_k=1)
python
streaming_response = await query_engine.
astream_chat("Is 2024 a leap year?")
async for text in streaming_response.async_response_gen():
print(text, end="", flush=True)
That's it! I hope you found this information useful. I plan to submit a pull request to update the documentation with more examples on streaming. Wishing you all a Happy New Year!