5 posts tagged with "prompt engineering"

Leverage the Image Capabilities of LLMs, Enhanced Tool Use, and More!

August 23, 2024 · 3 min read

Yue Zhang

founder of EvalsOne

We have been working to bring you more features and a smoother experience, dedicated to providing a one-stop workbench for generative AI builders.

We are thrilled to announce a series of recent feature updates:

Enhanced Tool Use and Image Generation

Users can now use pre-set image generation tools on ConsoleX to create high-quality images, currently supporting cutting-edge image models such as Stable Image, Dalle-3, and Flux.1 pro.

Image Input

We have also expanded the ways to use custom tools on ConsoleX; you can invoke any web-based functions or workflows from Dify and Make.com to extend the capabilities of the large models. You can even call multiple tools in parallel to generate images in batch.

Try Vision Input at various LLMs

The image input feature is available for free to Beginner Plan users, allowing easy comparison of image recognition effects across multiple large language models.

Image Input

Support for Structured Output from OpenAI

For developers, structured output provides more flexibility and precision in building applications based on large language models. On ConsoleX, when using GPT-4o and GPT-4o mini, simply toggle a switch and attach an output structured JSON Schema to easily debug OpenAI's structured output functionality.

More User-Friendly Prompt Library

The improved prompt library interface is more user-friendly, making it easier and more intuitive to manage and add your own frequently used system prompts, and use them in any conversation at any time.

Upgrade of Beginner Plan

Beginner Plan users can also add up to 3 custom integrations to fully enjoy the convenience brought by the ConsoleX workstation.

Added Support for More Models

Shared models have added Gemini-1.5-flash and Gemini-1.5-pro, which support up to 200K prompt space. Gemini-1.5-flash is open for all users, while Gemini-1.5-pro is available for Builder Plan users.
The shared gpt-4o model has been switched to OpenAI's latest gpt-4o-2024-08-06, with input tokens prices reduced by 50% and output tokens reduced by 1/3.
Private models now support integration with the OpenRouter and Google Vertex AI platforms, allowing users to add any model from OpenRouter and their own models from Vertex AI for API invocation.
Over 10 preset smart agents have been added to the smart agent store, such as file conversations, video content summaries, stock analysis, SEO expert, etc.
Quick start smart agent conversations with @. In the dialogue bar, simply type @ to quickly summon and initiate conversations with common smart agents.

There are more detailed optimizations waiting for you to explore and discover the surprises! Check it out at ConsoleX.ai!

Massive ConsoleX update just dropped! - Vision input, Agent workshop and more

August 1, 2024 · 2 min read

Yue Zhang

founder of EvalsOne

We've got some exciting news that's going to make your ConsoleX experience even better. We've just rolled out a major update, and trust us, you're going to want to check this out.

new homepage

Here’s what’s new in ConsoleX:

Image Input: Your AI just got eyes! Use visual input across multiple models, such as GPT-4o, Claude-3.5-Sonnet
Agent Workshop: Think of it as your AI dream team. Curated agents ready to boost your productivity.
Enhanced Tool Calls: Debug and integrate external tools seamlessly. It's like giving your AI a swiss army knife.
New Integrations: We've joined forces with Dify, FastGPT, Coze, and OpenRouter. More power, more possibilities.

These aren't just new features - they're game changers. ConsoleX is now your all-in-one AI playground, ready to tackle whatever challenges you throw at it.

Here's the cherry on top: Our Builder Plan, packed with all these new goodies, is still incredibly affordable at just $8 per month. That's a lot of AI firepower for less than a fancy lunch.

Don't miss out on this chance to elevate your AI game. Log in to ConsoleX.ai now and start exploring these new features.

We can't wait to see what you'll create with this upgraded toolkit. Let's push the boundaries of AI together!

How to Optimize LLM-based Systems Through Iterative Evaluation

July 12, 2024 · 21 min read

Yue Zhang

founder of EvalsOne

Car Driving

If you only drive occasionally, you might not be very picky about the car you use. But if you are a professional driver who spends every day on the road, you will certainly become a demanding user, expecting the car to be highly performant, comfortable, and reliable. Similarly, car manufacturers facing discerning professional drivers must spare no effort to perfect their vehicles, as any minor flaw could lead to customer loss.

Large language models (LLMs) possess powerful reasoning capabilities and the ability to generalize from limited information, fundamentally changing the way we process information and solve problems. However, general-purpose LLMs have already been monopolized by giants. For LLM system developers, the real value lies in more effectively and efficiently solving specific tasks in vertical domains.

Nevertheless, this is no easy task. The capabilities of large models and their unpredictability are like a high-performance but difficult-to-control car. Developers need to continually iterate and evaluate, maximizing the model's intelligence while ensuring the stability and predictability of its output, keeping variability within an acceptable range. This is akin to creating a car that is both high-performing and easy to handle. Only by doing so can they stand out in the fierce market competition and become the preferred choice of users.

Understanding LLM-based System Evaluation

When it comes to evaluation, many people first think of evaluating the LLM itself. However, evaluating an LLM and evaluating a system built on an LLM are completely different concepts. Evaluating an LLM focuses on the model’s performance and capabilities, like measuring a person’s IQ. On the other hand, evaluating a system or application built on an LLM emphasizes its performance in solving real-world problems in specific scenarios, much like assessing a professional's job performance in a particular field.

For LLM-based system or application developers, it’s usually unnecessary to spend too much effort on evaluating the general capabilities of the model. Instead, they can start with the most capable models and focus on how to leverage the model’s abilities to build workflows that solve tasks in vertical domains.

LLM systems are not merely simple Q&A systems but consist of multiple intertwined decision-making and execution steps. Previously, decision-making often relied on classification models trained through supervised learning. Now, large models can achieve this through tool-calling mechanisms. The execution steps can be handled by large models or through a combination of LLMs and external knowledge, known as Retrieval-Augmented Generation (RAG), or by external RPA workflows, with the LLM summarizing the results.

LLM-based System

In this process, continuous optimization of prompts, models, and interaction processes is necessary. For complex systems, different models might be used at different stages, and also involves collaboration among multiple models. The greater challenge lies in building a system that combines intelligence with reliability.

This process is not as simple as stacking building blocks, it’s a challenging system engineering task. Constructing a well-performing LLM-based typically goes through four stages:

Prototype Foundation: Conceptualize, rapidly build a prototype, and perform initial validation.
End-to-End Evaluation: Set goals, determine evaluation criteria, prepare evaluation samples, run end-to-end evaluations, and analyze results.
Component-wise Evaluation and Optimization: Identify issues based on evaluation on components and intermediate steps of the system, and optimize the system accordingly
Real-world Validation: Deploy the system online once it meets the expected standards of quality and performance. This is not the end of optimization but a new beginning. Continuous monitoring, data collection, and repeated optimization are necessary after deployment.

Next, we will elaborate on the ideas and methods of validation and evaluation for these four stages respectively.

Step 1: Quickly Build Prototypes, Iterate and Validate

When you have any creative ideas, rapidly building a prototype is a crucial step in bringing those ideas to fruition.

As mentioned earlier, LLM systems are composed of a series of intertwined decision-making and execution steps. Each step involving LLM will involve selecting an appropriate base model, using customized prompts, and may involve knowledge retrieval or tool calls. The challenge is how to chain these steps together through workflows to achieve automated execution around target tasks.

There are many development tools that can help us quickly build prototypes, such as development frameworks like LangChain, or choosing OpenAI's Assistants API, as well as visual workflow orchestration platforms like Coze, Dify, Flowise and Langflow. These allow us to ignore some underlying technical details and focus more energy on business processes.

In the process of building system prototypes, we already need to conduct preliminary validation of the system. The goal is to ensure that the workflow can run normally and has a certain fault tolerance for unexpected situations, achieving baseline generation stability.

In traditional machine learning training, to obtain a usable system, we need to do a lot of data preparation. This includes steps such as data collection, cleaning, and labeling. However, because LLMs have already been pre-trained on massive diverse data, they possess powerful basic intelligence and excellent generalization abilities. This makes it possible for us to build prototypes through rapid iteration.

When the output effect of the initial system is not satisfactory, we can analyze the reasons for the problems, summarize common patterns, and optimize the prompts through some prompt engineering techniques (such as reflection techniques, chain-of-thought, few-shot prompting, etc.). We can then compare the new output results with the original ones, without needing to retrain like in traditional machine learning every time.

Test prompt in Playground

At this stage, the requirement for sample size is not high. We can use human-made test cases, use the Playgrounds provided by LLM providers and orchestration tools, or use ConsoleX.ai's one-stop workbench for testing, and leverage human expert experience for evaluation.

However, evaluation in Playgrounds can only cover specific use cases and cannot give us enough confidence that the system will generate stable responses. At this point, we should move on to the next stage and conduct more systematic evaluations based on larger-scale data.

Unfortunately, many developers put time and energy into system building without thorough evaluation before bringing products to market. It's like driving a prototype car that hasn't passed safety inspections onto the road - the results can be serious or even disastrous. For example, Air Canada was forced to lose a lawsuit and refund and compensate customers due to its customer service chatbot giving out misleading promotional policies. In professional fields such as healthcare, law, and psychological counseling, using LLM-based systems to provide services obviously requires more caution, as any mistake could cost the company dearly.

Canada airline

Step2: Preparing for End-to-End Evaluation

So, how can we more systematically optimize LLM systems through evaluation?

First, you should be mentally prepared that after entering this new stage, you'll need to adopt some new methods, use professional evaluation tools and frameworks, and invest some time and effort in preparation work. This process won't be accomplished overnight, and repeated iterations may be needed before all indicators reach the benchmark targets.

However, if you want your LLM-based system to stand out from fierce competition or truly be competent for work tasks in vertical domains, rather than being busy dealing with various unexpected situations, such investment is necessary and worthwhile.

Specifically, before entering the end-to-end evaluation stage, you need to make the following three types of preparations:

Establish evaluation metrics and benchmarks Based on business goals and experience from the previous stage, determine the dimensions to be assessed and quantify the goals to be achieved in each dimension as benchmarks.
Determine evaluation methods, establish evaluators You can choose between manual evaluation or automatic evaluation. If it's automatic evaluation, you also need to build evaluators around evaluation metrics and quantitative standards to automate the evaluation process.
Prepare datasets needed for running evaluations Spend time preparing domain-relevant evaluation datasets.

Prepare for End-to-End Evaluation

Determine Evaluation Metrics and Benchmarks

First, we need to determine the evaluation metrics and the final benchmarks to be achieved.

In traditional NLP tasks, commonly used evaluation metrics include accuracy, precision, recall, and F1 score. For tasks based on generative AI, determining evaluation metrics is usually more complex and diverse.

The setting of evaluation metrics is also related to your actual application scenario. For example, for medical applications, reference accuracy might be key, but for casual chatbots, contextual coherence is more important.

In practical application scenarios, most LLM systems need to be evaluated on multiple metrics, not just a single metric. For example, customer service chatbots need to be evaluated on dimensions such as answer accuracy, coherence, knowledge citation correctness, response time, and user satisfaction.

After determining the metrics to be evaluated, we need to set corresponding benchmarks. Setting benchmarks helps clarify optimization directions and provides references for system improvement.

Good benchmarks should be both measurable and achievable:

Measurability: Use quantitative indicators or clear qualitative scales, rather than vague expressions.
Achievability: Set targets based on industry benchmarks, prior experiments, artificial intelligence research, or expert knowledge, avoiding benchmark targets that are detached from reality.

Manual Eval vs. Auto Eval

Next, we need to determine the method of evaluation based on the metrics. We can choose between manual evaluation or automatic evaluation.

While manual eval can provide detailed feedback and understanding of complex issues, it is time-consuming and susceptible to subjective bias. In contrast, auto eval can quickly and consistently process large amounts of data, providing objective results, but requires more time and efforts in preparation.

Human vs. Auto Eval

Especially considering that evaluation needs to be based on large-scale data, and with the continuous evolution of LLM capabilities, the same evaluation may need to be repeated. Therefore, automating the evaluation process as much as possible becomes very important.

If auto eval is chosen, the next task is to establish evaluators around metrics and benchmarks. An evaluator is an automated program that performs the evaluation process according to one or more metrics and judges whether the eval results meet the benchmarks.

Based on the means by which evaluators complete tasks, they can be divided into algorithm-based, specialized model-based, and LLM-based evaluators. For scenarios with clear rules, we can adopt algorithm-based evaluators, such as judging the matching degree between texts, similarity based on specific algorithms, consistency of JSON data or schemas, etc. For classification tasks, we can use specialized models trained with traditional ML methods for evaluation, such as using pre-trained sentiment analysis models to evaluate the sentiment polarity (positive, negative, neutral) of user feedback. LLM-based evaluators can effectively evaluate in more complex and diverse generation scenarios, ensuring that the quality performance and user experience of LLM systems exceed expected benchmarks.

Around the same evaluation metric, multiple different types of evaluators can be used for evaluation. For example, for text similarity, both embedding distance algorithms and LLM can be used for evaluation.

LLM as a Judge

Why can LLMs excel in the role of evaluators?

First, there may be differences in intelligence levels between models used for generation and those used for evaluation. Due to practical factors such as cost, privacy, device limitations, or latency, we don't need to (or can not) use the most capable models in all scenarios. However, to pursue the accuracy of evaluation results, we usually need to apply the most capable models to evaluation.

Moreover, the focus of generation and evaluation processes is different. For example, in content creation scenarios, generative models need to possess high creativity and flexibility, capable of generating diverse and creative content, like writers considering various possibilities during creation. Evaluation, on the other hand, requiring models to strictly review and score text under specific standards, like editors needing to evaluate manuscript quality based on established criteria. Generative models focus on diversity and creativity, while evaluation models focus on accuracy and consistency.

Depending on whether ideal answers are provided, evaluators can be divided into two categories: with reference and without reference. Evaluators with reference need to prepare reference answers before evaluation begins, which can be one or more options to be chosen, or a perfect answer or problem-solving thought process. If no reference answer is set, then the prompt needs to guide the model to make judgments based on its own knowledge and reasoning ability.

Using LLMs as judges for evaluation is essentially a generation task relying on evaluation prompt templates. Therefore, using high-level AI models and adopting well-tested prompts is crucial for the accuracy of eval results. For more personalized scenarios, general evaluation prompts may not be fully applicable. In this case, it's necessary to customize evaluators that meet specific needs. EvalsOne provides a method to create custom evaluators through YAML configuration files.

Finally, we also need to be aware of the limitations of LLMs as Judges since it is essentially based on LLMs and the eval prompts, and is similarly limited by the quality of prompts and the basic capabilities of the large models used. Moreover, LLMs as Judges still have a certain probability of making mistakes, and prompts used for evaluation still need to be evaluated and verified.

However, this doesn't mean we should abandon LLMs due to these challenges. On the contrary, from the overall perspective of cost and benefit, LLMs as judges has significant advantages in many aspects. With continuous technological progress and improvement, these issues can be gradually resolved. By combining the advantages of manual and auto evals, we can establish a more efficient, accurate, and comprehensive evaluation system.

Preparing Eval Datasets

Next, we need to prepare larger evaluation datasets, also called Test Cases. Rather than preparing massive evaluation datasets at once, we can consider adopting a progressive approach. First, expand the sample set to a larger scale, run the evaluation, and if the results meet the standards, then expand to an even larger sample set, increasing the diversity of the data and introducing challenging edge cases. If the evaluation on the expanded sample set doesn't meet the standards, then we need to identify problems and optimize the system.

For a long time, preparing evaluation sample sets has been primarily done manually, which is both time-consuming and tedious. With the continuous advancement of large language model capabilities, using LLMs to expand evaluation datasets has become possible. More and more research and solutions are emerging in this field. For example, Claude's Workbench integrates the function of intelligently generating variables, which can automatically generate diverse test cases.

However, using large models to expand evaluation sample sets also faces some challenges. A major problem is the tendency for data homogenization, leading to a lack of diversity in test cases. Additionally, large language models sometimes produce hallucinations, generating incorrect results. Therefore, at the current stage, manual review of AI-generated sample data remains an indispensable part of the process. Manual review not only ensures the accuracy and diversity of data but also helps identify and correct errors that occur during model generation.

By combining human and AI power, we can more efficiently generate diverse test cases, ensuring that the system performs excellently in various situations.

Step 3: Component-wise Evaluation and Optimization

Once the metrics and benchmarks are determined, and automated evaluators are established, we can start running end-to-end evaluations to judge the overall quality and performance of the current system.

Next, collect result data and judge the gap between the evaluation results and the expected benchmarks. If the expected goals are achieved, we can expand the dataset and increase the challenge of the data. But if the expected goals are not met, we need to analyze the reasons from the process and optimize the system.

This is again a difficult process that often gives people a feeling of not knowing where to start. Is it a problem with the model itself, or with the prompts, or with the RAG data retrieval process, or with the design of the Agent workflow?

At this point, we need to break down the whole system into specific sub-steps and components and put them under the microscope for evaluation. This requires us to have a further understanding of the working mechanism of the LLMs and the systems.

For example, for a RAG pipeline, it can be specifically divided into retrieval process and generation process. The slicing method, embedding algorithm, retrieval method, and whether the LLM generation faithfully utilizes the retrieved contexts will all affect the final generation result. At this point, we need to deconstruct the process, trace what exactly happened behind each step, which is a bit like debugging in software development. Some third-party tools provide such tracing functionality, such as LangChain, Langfuse, etc., allowing us to have a better grasp of process details and adjust and optimize system parameters.

However, Tracing is based on single or few samples. For batch data, more professional evaluation tools like EvalsOne and systematic evaluation methods are needed. In the process of component-wise evaluation, the eval metrics, the compoments involved and benchmarks needed are different from end-to-end evaluation. Still taking the evaluation of RAG pipeline as an example, Ragas proposed four evaluation metric dimensions for RAG evaluation: answer relevance, context precision, context recall, and faithfulness, providing a practical approach for improving the overall effect of the RAG pipeline through evaluation.

Evaluation is also extremely important for AI Agents. A complex AI Agent needs to use prompt chains, needs to have the ability to make decisions and utilize external tools, needs to combine vertical domain workflow RPA, and needs to use the results of the workflow for secondary generation. Projects like AutoGPT have shown us the potential of Agents, but also exposed their limitations. Every step of decision-making during Agent execution may lead to deviation from the goal, which requires the Agent to have the ability of self-reflection and error correction, and also requires evaluation of the execution effect of each step.

The evaluation of Agents is still a challenging and exploratory field. Some have proposed the concept of Guardrail for Agent evaluation, correcting (e.g., regenerating) in a timely manner by evaluating the effect of intermediate steps in real-time. However, the "Guardrail method" can only prevent the Agent's execution from deviating from the track as much as possible, and cannot guarantee that the Agent can complete the established goals well. Moreover, repeated error correction may bring very high additional delays and costs.

When process-based analysis and evaluation help us find the reasons and make targeted optimizations, we still need to re-run end-to-end evaluations to ensure that the operational effect of the LLM system has indeed improved. We may need to repeat this process from whole to compoments and back to whole multiple times until the expected benchmarks are finally reached.

from end-to-end evaluation to component-wise evaluation

Sometimes, we may find that system optimization may be limited by the prototype building tools used. For example, visual orchestration tools like Coze, Dify, Langflow, while lowering the barries for Agent building, also have limited tuning capabilities. In practice, we can first use some visual orchestration tools to quickly build prototypes, evaluate the effects, and try to tune. But if we are always unable to reach the expected benchmark due to the limitations of the framework, we should consider using a more underlying way to reconstruct the system, so that we can have better controllability over the intermediate processes and parameters of the LLM-based system.

Step 4: Real-world Validation and Monitoring

"It's easy to make something cool with LLMs, but very hard to make something production-ready with them" ——Chip Huyen

The more thorough the previous evaluations, the more confidence we will have in the performance of the LLM-based system in the real world. However, due to the characteristics of the production environment, we should maintain stronger vigilance against errors, so evaluation and monitoring in the production environment are often inseparable.

When LLM systems provide services in a production environment, it's impossible to fully evaluate each output before presenting it to end users, as this would increase system latency and affect user experience. However, we can still collect real-world data through APIs for post-hoc evaluation, and strengthen alerting mechanisms to make agile responses and necessary handling for abnormal situations.

At the same time, it's necessary to periodically review and check the output quality. Real-world situations can always give us some inspiration for improving the system. From development to testing, and then to the production environment, it's a process of continuously generalizing sample data, pursuing excellence through continuous improvement and integration (CI/CD), making the system more robust.

However, this doesn't mean our goal is to make the system achieve 99.999% accuracy. The intelligence and generalization ability of generative AI inevitably comes with unpredictability. From a business perspective, what we need to do is reduce the probability of system errors to below the tolerance limit, while achieving cost reduction and efficiency improvement, thereby maximizing business benefits.

Open Source Tools vs. SaaS Platforms

Finally, let's talk about the tools needed for LLM-based system evaluation. If you're a tech expert, there are many open-source evaluation frameworks to choose from, such as OpenAI's EVALS, Promptfoo, Ragas etc. LangChain and LlamaIndex also include evaluation components. Their advantages are that they are open-source, transparent, and free, and you don't have to worry too much about data privacy issues.

However, open-source evaluation tools often provide more technical-level solutions, requiring users to have a certain development background, and are more difficult for roles outside of team development to get started with. Moreover, they provide only evalation tools instead of complete solutions. If you want to achieve automation of the evaluation process, including data collection, preparation, iterative evaluation process management, and monitoring in the production environment, secondary development is required.

Many LLM providers and hosting platforms also provide evaluation functions, but they often only support their own models and have limited functionality. Comparatively, SaaS platforms like EvalsOne have the advantages of more friendly interfaces and usage processes, broader model support, can lower the threshold for introducing evaluation processes in teams, and can provide a complete set of evaluation solutions from development environment to production environment. This allows developers to focus more energy on system optimization, rather than reinventing the wheel on evaluation systems or having to deal with sample data all day.

Of course, what is a wise decision depends on the specific situation of each team and scenario. For example, in scenarios where data privacy is extremely important, adopting an open-source solution may still be the preferred choice. For business-priority scenarios, the intuitive interface and comprehensive solutions of SaaS systems can allow team members of various roles to smoothly participate in the improvement process of LLM-based systems.

Takeaway

Finally, let's summarize the key points of this article:

LLM-based system evaluation is different from LLM evaluation, focusing more on the specific performance of the system in solving practical problems in vertical scenarios, which is crucial for building reliable, high-performance AI applications.
Building an outstanding LLM system through evaluation usually goes through four stages: prototype building stage, end-to-end evaluation stage, component-wise evaluation and optimization stage, and real world validation stage. Testing and evaluation should run through the entire process.
From prototype building to end-to-end evaluation, some preparations are needed first: establishing evaluation metrics and benchmarks, determining evaluation methods and creating evaluators, and preparing eval datasets.
Using LLM as an judge is an effective method. Introducing auto eval can improve efficiency and effectiveness compared to manual eval, but it's also necessary to accept the uncertainty that comes with it.
Optimizing LLM-based systems through iterative evaluation is a cyclical process from whole to components and back to whole, needing to be repeated until the benchmark is reached.
Evaluation in the production environment is inseparable from monitoring, requiring the establishment of alerting mechanisms and continuous improvement driven by results.
When choosing evaluation solutions, it's necessary to weigh the pros and cons of open-source tools and SaaS platforms, and make decisions based on team situations and application scenarios.

Linking the previous steps together, we can get an overall roadmap for creating excellent LLM-based systems through iterative evaluation: overall evaluation roadmap

Although there are already many popular tools and solutions for LLM-based system evaluation, due to the rapid development of generative AI and the complexity of the evaluation process itself, teams that truly incorporate evaluation smoothly into their workflows still only account for a small proportion. Our goal in developing EvalsOne is to make the evaluation process simpler, more intuitive, easy to get started with, while being comprehensive in functionality and cost-controllable.

Later, we will deliver follow-up articles on preparing datasets, establishing metrics and evaluators, and the methods for optimizing various LLM systems through evaluations, enabling teams to fully master evaluation tactics to create more LLM systems with excellent quality and performance.

New Features and Improvements - More Model Support and Many Practical Enhancements

March 14, 2024 · 2 min read

Yue Zhang

founder of EvalsOne

EvalsOne will continue to strive for improvement and innovation, delivering an outstanding experience for AI model evaluation. We are thrilled to announce some major updates that bring an entirely new experience to our users.

Model Support Updates:

Added support for models on Amazon Bedrock and Groq platforms, expanding the range of models that can be evaluated.
Integration with Ollama, allowing you to evaluate local models via tunnels, breaking the geographical barriers of evaluation.
Expanded our Chinese model providers with 8 new options: Baidu, ChatGLM, Moonshot, Qwen, Baichuan, Xunfei, TianGong, and MiniMax. This provides more choices for evaluating Chinese models.

Feature Enhancements:

You can now export samples and variables, facilitating data archiving and sharing. Clone runs have more flexibility with multiple level cloning, catering to diverse scenarios.
When creating/cloning runs, you can customize temperature and maximum tokens, enabling more granular control.
Set maximum threads for private models, optimizing resource utilization.
Save conversation messages as templates for samples, streamlining the preparation for subsequent evaluations.
Manual evaluation enabled with scoring capability, providing convenience for subjective evaluations.
Added average completion time and Model Generation Stability Index (MGSI) as new benchmarks for reporting.
These updates provide users with more model options, better customization capabilities, and improved efficiency. If you have any questions, feel free to reach out to us. -

We are now inviting a limited number of seed users (over 200) to join our private beta testing and help us shape the future of LLM prompt evaluation. Don't hasitate to join us and experience the power of our advanced LLM evaluation platform firsthand. Sign up to join the waitlist of private beta testing at https://evalsone.com and start building better AI apps!

Introducing EvalsOne - A New Way to Evaluate LLM prompts

February 13, 2024 · 2 min read

Yue Zhang

founder of EvalsOne

We are thrilled to announce that EvalsOne has officially entered the pre-launch stage after months of hard work and dedication, and we're excited to unveil our creation to the world.

Preview image

For over two years, our team has been developing an AI-powered mental health chatbot. As early adopters of OpenAI's API, we integrated prompt engineering and fine-tuned models to provide users with a unique and valuable experience. However, the inherent unpredictability of large language models (LLMs) can sometimes lead to inconsistencies in the user experience – a significant concern when dealing with the sensitive topic of mental health.

Initially, we relied on manual prompt testing in playgrounds, but quickly realized the need for a more efficient approach to improve prompt quality. This led us to delve into the world of prompt evaluation.

While existing prompt evaluation tools and products were available, they often lacked the comprehensiveness or ease of use we required. Determined to optimize our workflow, we decided to develop our own solution.

Our ideal product should be:

Easy to use: It should not require a high technical threshold, and all roles in the team should be able to use it easily.
Open and flexible: It should be able to evaluate various models and flexibly set evaluation metrics.
Systematic and comprehensive: It should cover the whole process from sample preparation, model selection, metric setting, to result feedback.

From this vision, EvalsOne was born. Now an integral part of our internal workflows, EvalsOne has dramatically improved our efficiency while significantly boosting team satisfaction. By automating tedious tasks, we've gained the freedom to focus on innovation and creativity.

We're now inviting a limited number of seed users (over 200) to join our private beta testing and help us shape the future of LLM prompt evaluation. As a thank you, you'll receive:

$50 in initial credit
3 months of Standard Plan for free

Join the EvalsOne community and experience the power of our advanced prompt engineering platform firsthand. Sign up to join the waitlist of private beta testing at https://evalsone.com and start building better AI apps!

Enhanced Tool Use and Image Generation​

Try Vision Input at various LLMs​

Support for Structured Output from OpenAI​

More User-Friendly Prompt Library​

Upgrade of Beginner Plan​

Added Support for More Models​

Understanding LLM-based System Evaluation​

Step 1: Quickly Build Prototypes, Iterate and Validate​

Step2: Preparing for End-to-End Evaluation​

Determine Evaluation Metrics and Benchmarks​

Manual Eval vs. Auto Eval​

LLM as a Judge​

Preparing Eval Datasets​

Step 3: Component-wise Evaluation and Optimization​

Step 4: Real-world Validation and Monitoring​

Open Source Tools vs. SaaS Platforms​

Takeaway​

Enhanced Tool Use and Image Generation

Try Vision Input at various LLMs

Support for Structured Output from OpenAI

More User-Friendly Prompt Library

Upgrade of Beginner Plan

Added Support for More Models

Understanding LLM-based System Evaluation

Step 1: Quickly Build Prototypes, Iterate and Validate

Step2: Preparing for End-to-End Evaluation

Determine Evaluation Metrics and Benchmarks

Manual Eval vs. Auto Eval

LLM as a Judge

Preparing Eval Datasets

Step 3: Component-wise Evaluation and Optimization

Step 4: Real-world Validation and Monitoring

Open Source Tools vs. SaaS Platforms

Takeaway