Optimizing generative AI by backpropagating language model feedback (2025)

  • Article
  • Published:

Nature volume639,pages 609–616 (2025)Cite this article

  • 5830 Accesses

  • 66 Altmetric

  • Metrics details

Subjects

  • Computational science
  • Computer science

Abstract

Recent breakthroughs in artificial intelligence (AI)are increasingly driven by systems orchestrating multiple large language models (LLMs)and other specialized tools, such as search engines and simulators. So far, these systems are primarily handcrafted by domain experts and tweaked through heuristics rather than being automatically optimized, presenting a substantial challenge to accelerating progress. The development of artificial neural networks faced a similar challenge until backpropagation and automatic differentiation transformed the field by making optimization turnkey. Analogously, here we introduce TextGrad, a versatile framework that performs optimization by backpropagating LLM-generated feedback to improve AI systems. By leveraging natural language feedback to critique and suggest improvements to any part of a system—from prompts to outputs such as molecules or treatment plans—TextGrad enables the automatic optimization of generative AI systems across diverse tasks. We demonstrate TextGrad’s generality and effectiveness through studies in solving PhD-level science problems, optimizing plans for radiotherapy treatments, designing molecules with specific properties, coding, and optimizing agentic systems. TextGrad empowers scientists and engineers to easily develop impactful generative AI systems.

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Change institution

Buy or subscribe

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

$29.99 /30days

cancel any time

Learn more

Subscribe to this journal

Receive 51 print issues and online access

$199.00 per year

only $3.90 per issue

Learn more

Buy this article

  • Purchase on SpringerLink
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

Optimizing generative AI by backpropagating language model feedback (1)
Optimizing generative AI by backpropagating language model feedback (2)
Optimizing generative AI by backpropagating language model feedback (3)
Optimizing generative AI by backpropagating language model feedback (4)

Similar content being viewed by others

Optimizing generative AI by backpropagating language model feedback (6)

Generative artificial intelligence performs rudimentary structural biology modeling

Article Open access 21 August 2024

Optimizing generative AI by backpropagating language model feedback (7)

Foundation models for generalist medical artificial intelligence

Article 12 April 2023

Data availability

We used publicly available data to evaluate TextGrad. Details on how to access the data is available at https://github.com/zou-group/textgrad.

Code availability

The TextGrad code and experiments are available at https://github.com/zou-group/textgrad and https://doi.org/10.5281/zenodo.14497017 (ref. 46).

References

  1. Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).

  2. Trinh, T. H., Wu, Y., Le, Q. V., He, H. & Luong, T. Solving olympiad geometry without human demonstrations. Nature 625, 476–482 (2024).

    Article ADS CAS PubMed PubMed Central Google Scholar

  3. Li, Y. et al. Competition-level code generation with alphacode. Science 378, 1092–1097 (2022).

    Article ADS CAS PubMed MATH Google Scholar

  4. Yang, J. et al. SWE-agent: agent–computer interfaces enable automated software engineering. In Adv. Neural Inf. Process. Syst. 37 (NeurIPS, 2024).

  5. Khattab, O. et al. DSPy: Compiling declarative language model calls into state-of-the-art pipelines. In The Twelfth International Conference on Learning Representations (2024).

  6. Zaharia, M. et al. The shift from models to compound AI systems. BAIR https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/ (2024).

  7. Zhou, Y. et al. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations (2023).

  8. Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Adv. Neural Inf. Process. Syst. 25 (NeurIPS, 2012).

  9. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article ADS CAS PubMed PubMed Central MATH Google Scholar

  10. Fawzi, A. et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature 610, 47–53 (2022).

    Article ADS CAS PubMed PubMed Central MATH Google Scholar

  11. Mankowitz, D. J. et al. Faster sorting algorithms discovered using deep reinforcement learning. Nature 618, 257–263 (2023).

    Article ADS CAS PubMed PubMed Central MATH Google Scholar

  12. Merchant, A. et al. Scaling deep learning for materials discovery. Nature 624, 80–85 (2023).

    Article ADS CAS PubMed PubMed Central MATH Google Scholar

  13. Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).

  14. Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors. Nature 323, 533–536 (1986).

    Article ADS MATH Google Scholar

  15. Pryzant, R. et al. Automatic prompt optimization with “gradient descent” and beam search. In Proc. 2023 Conference on Empirical Methods in Natural Language Processing (eds Bouamor, H. et al.) 7957–7968 (Association for Computational Linguistics, 2023).

  16. Zheng, L. et al. Judging LLM-as-a-judge with MH-bench and chatbot arena. Adv. Neural Inf. Process. Syst. 36, 46595–46623 (2023).

  17. Li, X. et al. AlpacaEval: an automatic evaluator of instruction-following models. GitHub https://github.com/tatsu-lab/alpaca_eval (2023).

  18. Bai, Y. et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. Preprint at https://arxiv.org/abs/2204.05862 (2022).

  19. Madaan, A. et al. Self-refine: iterative refinement with self-feedback. In Adv. Neural Inf. Process. Syst. 36 (NeurIPS, 2023).

  20. Stiennon, N. et al. Learning to summarize with human feedback. In Adv. Neural Inf. Process. Syst. 33, 3008–3021 (2020).

  21. Yuan, W. et al. Self-rewarding language models. In Forty-first International Conference on Machine Learning (2024).

  22. Dubois, Y. et al. AlpacaFarm: a simulation framework for methods that learn from human feedback. In Adv. Neural Inf. Process. Syst. 36 (NeurIPS, 2023).

  23. Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K. & Yao, S. Reflexion: language agents with verbal reinforcement learning. Adv. Neural Inf. Process. Syst. 36, 8634–8652 (2023).

  24. Rein, D. et al. GPQA: a graduate-level Google-proof Q&A benchmark. In First Conference on Language Modeling (2024).

  25. Hendrycks, D. et al. Measuring massive multitask language understanding. In The Ninth International Conference on Learning Representations (2021).

  26. Lu, P. et al. MathVista: evaluating mathematical reasoning of foundation models in visual contexts. In The Twelfth International Conference on Learning Representations (2024).

  27. Lu, P. et al. Learn to explain: multimodal reasoning via thought chains for science question answering. Adv. Neural Inf. Process. Syst. 35, 2507–2521 (2022).

  28. Liu, P. et al. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55, 1–35 (2023).

    MATH Google Scholar

  29. Suzgun, M. et al. Challenging BIG-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023 13003–13051 (Association for Computational Linguistics, 2023).

  30. Cobbe, K. et al. Training verifiers to solve math word problems. Preprint at https://arxiv.org/abs/2110.14168 (2021).

  31. Yang, C. et al. Large language models as optimizers. In The Twelfth International Conference on Learning Representations (2024).

  32. Dubey, A. et al. The Llama 3 herd of models. Preprint at https://arxiv.org/abs/2407.21783 (2024).

  33. Yang, A. et al. Qwen2 technical report. Preprint at https://arxiv.org/abs/2407.10671 (2024).

  34. Khan, F. M., Gibbons, J. P. & Sperduto, P. W. Khan’s Treatment Planning in Radiation Oncology (Lippincott Williams & Wilkins (Wolters Kluwer), 2016).

  35. Hussein, M., Heijmen, B. J. M., Verellen, D. & Nisbet, A. Automation in intensity modulated radiotherapy treatment planning—a review of recent innovations. Br. J. Radiol. 91, 20180270 (2018).

    Article PubMed PubMed Central Google Scholar

  36. Kisling, K. et al. Radiation planning assistant-a streamlined, fully automated radiotherapy treatment planning system. J. Vis. Exp. 134, e57411 (2018).

  37. Huang, C., Nomura, Y., Yang, Y. & Xing, L. Meta-optimization for fully automated radiation therapy treatment planning. Phys. Med. Biol. 67, 055011 (2022).

    Article CAS MATH Google Scholar

  38. Yang, Y. & Xing, L. Clinical knowledge-based inverse treatment planning. Phys. Med. Biol. 49, 5101 (2004).

    Article PubMed MATH Google Scholar

  39. Liu, S. et al. Automated radiotherapy treatment planning guided by gpt-4vision. Preprint at https://arxiv.org/abs/2406.15609 (2024).

  40. Lu, P. et al. Chameleon: plug-and-play compositional reasoning with large language models. Adv. Neural Inf. Process. Syst. 36, 43447–43478 (2023).

  41. Yan, B., Zhang, J., Yuan, Z., Shan, S. & Chen, X. Evaluating the quality of hallucination benchmarks for large vision-language models. Preprint at https://arxiv.org/abs/2406.17115 (2024).

  42. Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 35, 24824–24837 (2022).

  43. Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proc. COMPSTAT’2010 (eds Lechevallier, Y. & Saporta, G.) 177–186 (Physica-Verlag, 2010).

  44. Wang, Q. et al. High-dimensional automated radiation therapy treatment planning via bayesian optimization. Med. Phys. 50, 3773–3787 (2023).

    Article PubMed MATH Google Scholar

  45. Akiba, T., Sano, S., Yanase, T., Ohta, T. & Koyama, M. Optuna: a next-generation hyperparameter optimization framework. In Proc. 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2019).

  46. Bianchi, F. et al. zou-group/textgrad: v0.1.6. Zenodo https://doi.org/10.5281/zenodo.14497017 (2024).

Download references

Acknowledgements

We thank D. Yilmaz, F. Dinc, B. Ergun, Y. Sun, I. Covert, K. Swanson, O. Faruk Akgun, Y. Efe, O. Khattab, K. Y. Wu, E. Wu, K. Vodrahalli, O. Pastor Serrano, P. J. Chia, J. Tagliabue, N. Thakkar, E. Simon, S. Eyuboglu, I. Gao, L. Chen and members of the Zou group and the Guestrin group for their support and comments on this work. This work was supported by funding from the Chan-Zuckerberg Biohub. C.G. was supported by funding from the Chan-Zuckerberg Biohub, Stanford HAI, AFOSR Grant FA9550-21-1-0397, gifts from Google and IBM.

Author information

Author notes

  1. These authors contributed equally: Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang

Authors and Affiliations

  1. Department of Computer Science, Stanford University, Stanford, CA, USA

    Mert Yuksekgonul,Federico Bianchi,Carlos Guestrin&James Zou

  2. Department of Biomedical Data Science, Stanford University, Stanford, CA, USA

    Joseph Boen,Sheng Liu,Pan Lu,Zhi Huang&James Zou

  3. Chan Zuckerberg Biohub, San Francisco, CA, USA

    Carlos Guestrin&James Zou

Authors

  1. Mert Yuksekgonul

    View author publications

    You can also search for this author inPubMedGoogle Scholar

  2. Federico Bianchi

    View author publications

    You can also search for this author inPubMedGoogle Scholar

  3. Joseph Boen

    View author publications

    You can also search for this author inPubMedGoogle Scholar

  4. Sheng Liu

    View author publications

    You can also search for this author inPubMedGoogle Scholar

  5. Pan Lu

    View author publications

    You can also search for this author inPubMedGoogle Scholar

  6. Zhi Huang

    View author publications

    You can also search for this author inPubMedGoogle Scholar

  7. Carlos Guestrin

    View author publications

    You can also search for this author inPubMedGoogle Scholar

  8. James Zou

    View author publications

    You can also search for this author inPubMedGoogle Scholar

Contributions

M.Y. and J.Z. conceptualized the research and led the overall project. M.Y. developed the primary codebase and led prompt optimization and solution optimization. M.Y., F.B. and J.B. designed the abstractions. F.B. led code optimization. J.B. led molecule optimization. S.L. led treatment planning optimization and compound system optimization. P.L. led solution optimization in multimodal settings and compound system optimization. Z.H. and C.G. advised the project. J.Z. supervised the project. All authors contributed to the preparation of the paper and approved the final version.

Corresponding authors

Correspondence to Mert Yuksekgonul or James Zou.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature thanks Kai-Wei Chang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Molecule optimization via text.

TextGrad optimizes a starting benzene fragment to improve its druglikeness (higher QED) and binding affinity (lower vina score) to the protein receptor PPARA. The textual gradients for the first three iterations are shown in (a), and the performance of all ten iterations compared to clinically approved molecules targeting PPARA in (c). The molecule at the final iteration has low structural similarity with its most similar clinically approved counterpart, and better QED and Vina scores (d) with a highly plausible pose geometry shown in (e). Across 29 targets and three initial fragments, TextGrad successfully designs molecules with similar vina scores and greater QED scores than clinically approved molecules (b).

Extended Data Fig. 2 Comparing TextGrad to to other molecule optimization methods.

(a) Scaled mean top-5 AUCs for each method across all 58 protein targets. (b) Sample trajectories. For each algorithm, we selected the best performing trajectory to visualize, as measured by scaled Top-5 AUC, for the protein target listed, with the shaded error bars representing the standard error across the three repetitions (seeds/initial molecules). The blue star indicates the iteration at which TextGrad’s early stopping condition was triggered.

Extended Data Fig. 3 Code optimization details.

(a) We show the test-time objective we use for code optimization. (b) We report the results for LeetCode Hard using gpt-4o. We report the standard deviation over five random seeds in brackets.

Extended Data Fig. 4 Solution optimization details.

(a) Solution optimization for zero-shot question answering with gpt-4o. TextGrad outperforms the baselines consistently across different tasks. (b) We show the test-time objective objective we use for solution optimization. (c) We show an example where the mistake in the solution is identified in the test-time objective, and later the updated solution fixes the mistake.

Extended Data Fig. 5 Prompt optimization details.

(a) With TextGrad, we optimize a system prompt for gpt-3.5-turbo using gpt-4o as the gradient engine that provides the feedback during backpropagation. Supplementary Table 1 includes ablations with instruction-only and demonstration-only optimization with TextGrad. (b) We show an example of an optimized instruction for GSM8k. (c) We show an example of optimized in-context demonstrations for GSM8k.

Extended Data Fig. 6 Treatment planning details.

(a) We display the several dose metrics of the PTV target for all the clinical and TextGrad optimized plans, including the mean and minimum doses, as well as the D95. For all the metrics, we include the average deviations from the clinical goal across five plans and the standard deviation in brackets. Values in bold represent the best for each PTV target. (b) We show mean dose to capture OAR sparing. Lower values demonstrate better OAR sparing which is desirable, as this number indicates organs at risk, which should not get more than dosage than what is listed in the clinical guidelines. For all the metrics, we include the average mean dose across five plans and the standard deviation in brackets.

Supplementary information

Supplementary Information

This file contains Supplementary Notes 1–10, Figs. 1–4, Tables 1–3 and References.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Optimizing generative AI by backpropagating language model feedback (8)

Cite this article

Yuksekgonul, M., Bianchi, F., Boen, J. et al. Optimizing generative AI by backpropagating language model feedback. Nature 639, 609–616 (2025). https://doi.org/10.1038/s41586-025-08661-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41586-025-08661-4

Optimizing generative AI by backpropagating language model feedback (2025)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Clemencia Bogisich Ret

Last Updated:

Views: 6086

Rating: 5 / 5 (60 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Clemencia Bogisich Ret

Birthday: 2001-07-17

Address: Suite 794 53887 Geri Spring, West Cristentown, KY 54855

Phone: +5934435460663

Job: Central Hospitality Director

Hobby: Yoga, Electronics, Rafting, Lockpicking, Inline skating, Puzzles, scrapbook

Introduction: My name is Clemencia Bogisich Ret, I am a super, outstanding, graceful, friendly, vast, comfortable, agreeable person who loves writing and wants to share my knowledge and understanding with you.