Subsample and Asymmetry Breakdowns for the Baseline Strategy
Subsample / split	Per-trade (bp)	Sharpe	p
Panel A: subsample stability
Baseline (full sample)	+11.37	+0.38	^**
Ex-2022 (entry < 2022-01-01)	+5.88	+0.22
Panel B: hawkish vs dovish asymmetry
Hawkish surprises (s>0, flattener)	+6.07	+0.18
Dovish surprises (s<0, steepener)	+16.10	+0.61	^* * *
Panel C: disagreement subsample (LLM vs ED4)
LLM direction, magnitude-weighted	+8.49	+0.40	^**

Subsample and Asymmetry Breakdowns for the Baseline Strategy

Subsample / split

Per-trade (bp)

Sharpe

Hit (%)

Panel A: subsample stability

Baseline (full sample)

+11.37

+0.38

^**

Ex-2022 (entry < 2022-01-01)

+5.88

+0.22

Panel B: hawkish vs dovish asymmetry

Hawkish surprises (s>0, flattener)

+6.07

+0.18

Dovish surprises (s<0, steepener)

+16.10

+0.61

^* * *

Panel C: disagreement subsample (LLM vs ED4)

LLM direction, magnitude-weighted

+8.49

+0.40

^**

Acosta, M. (2023). A New Measure of Central Bank Transparency and Implications for the Effectiveness of Monetary Policy. *International Journal of Central Banking*, *19*(3), 49–97.

Adrian, T., Crump, R. K., & Moench, E. (2013). Pricing the term structure with linear regressions. *Journal of Financial Economics*, *110*(1), 110–138.

Ahrens, M., Erdemlioglu, D., McMahon, M., Neely, C. J., & Yang, X. (2024). Mind your language: Market responses to central bank speeches. *Journal of Econometrics*, 105921.

Ahrens, M., & McMahon, M. (2021). Extracting economic signals from central bank speeches. *Proceedings of the Third Workshop on Economics and Natural Language Processing*.

Aksit, D. (2020). Unconventional Monetary Policy Surprises: Delphic or Odyssean? *Available at SSRN 3602291*.

Andersson, M., Dillén, H., & Sellin, P. (2006). Monetary policy signaling and movements in the term structure of interest rates. *Journal of Monetary Economics*, *53*(8), 1815–1855.

Andrade, P., & Ferroni, F. (2021). Delphic and odyssean monetary policy shocks: Evidence from the euro area. *Journal of Monetary Economics*, *117*, 816–832.

Apel, M., & Blix Grimaldi, M. (2014). How Informative Are Central Bank Minutes? *Review of Economics*, *65*(1), 53–76.

Aruoba, S. B., & Drechsel, T. (2024). *Identifying Monetary Policy Shocks: A Natural Language Approach*. National Bureau of Economic Research.

Bauer, M. D., & Swanson, E. T. (2023a). A reassessment of monetary policy surprises and high-frequency identification. *NBER Macroeconomics Annual*, *37*(1), 87–155.

Bauer, M. D., & Swanson, E. T. (2023b). An Alternative Explanation for the “Fed Information Effect". *American Economic Review*, *113*(3), 664–700.

Bernanke, B. S. (2005). The logic of monetary policy. *Vital Speeches of the Day*, *71*(6), 165.

Bernanke, B. S., Reinhart, V. R., & Sack, B. P. (2004). *Monetary Policy Alternatives at the Zero Bound: An Empirical Assessment*. Brookings Institution.

Blinder, A. S., Ehrmann, M., Fratzscher, M., De Haan, J., & Jansen, D.-J. (2008). Central Bank Communication and Monetary Policy: A Survey of Theory and Evidence. *Journal of Economic Literature*, *46*(4), 910–945.

Bordalo, P., Gennaioli, N., Ma, Y., & Shleifer, A. (2020). Overreaction in macroeconomic expectations. *American Economic Review*, *110*(9), 2748–2782.

Bügel, D., Hidalgo, A., & Luetticke, R. (2026). *Unconventional but different after all? A unified series of narrative monetary policy shocks*.

Bybee, J. L. (2023). The Ghost in the Machine: Generating Beliefs with Large Language Models. *Working Paper, Yale School of Management*.

Bybee, L. (2023). Surveying Generative AI’s Economic Expectations. *arXiv Preprint arXiv:2305.02823*.

Caballero, R. J., & Simsek, A. (2022). Monetary Policy with Opinionated Markets. *American Economic Review*, *112*(7), 2353–2392.

Campbell, J. R., Evans, C. L., Fisher, J. D., Justiniano, A., Calomiris, C. W., & Woodford, M. (2012). Macroeconomic effects of federal reserve forward guidance \[with comments and discussion\]. *Brookings Papers on Economic Activity*, 1–80.

Campbell, J. Y., & Shiller, R. J. (1988). The dividend-price ratio and expectations of future dividends and discount factors. *The Review of Financial Studies*, *1*(3), 195–228.

Christiano, L. J., Eichenbaum, M., & Evans, C. L. (1999). Monetary policy shocks: What have we learned and to what end? *Handbook of Macroeconomics*, *1*, 65–148.

Cieslak, A. (2018). Short-rate expectations and unexpected returns in treasury bonds. *The Review of Financial Studies*, *31*(9), 3265–3306.

Cieslak, A., McMahon, M., & Pang, H. (2024). *Did I Make Myself Clear? The Fed and the Market under the 2020 Monetary Policy Framework* (CEPR Discussion Paper No. 19360). Centre for Economic Policy Research.

Cieslak, A., & Schrimpf, A. (2019). Non-monetary news in central bank communication. *Journal of International Economics*, *118*, 293–315.

Cieslak, A., & Vissing-Jorgensen, A. (2021). The economics of the Fed put. *The Review of Financial Studies*, *34*(9), 4045–4089.

Clarida, R., Gali, J., & Gertler, M. (1999). The Science of Monetary Policy: A New Keynesian Perspective. *Journal of Economic Literature*, *37*(4), 1661–1707.

Cloyne, J. S., Jorda, Ò., & Taylor, A. M. (2020). *Decomposing the fiscal multiplier* (NBER Working Paper No. 26939). National Bureau of Economic Research.

Cochrane, J. H. (2011). Presidential address: Discount rates. *The Journal of Finance*, *66*(4), 1047–1108.

Cochrane, J. H. (2025). *Inflation dynamics with a generalized lucas phillips curve* \[Working Paper\]. Hoover Institution; National Bureau of Economic Research.

De Fiore, F., Maurin, A., Mijakovic, A., & Sandri, D. (2024). *Monetary policy in the news: Communication pass-through and inflation expectations*. Bank for International Settlements, Monetary; Economic Department.

Diebold, F. X., & Mariano, R. S. (1995). Comparing predictive accuracy. *Journal of Business & Economic Statistics*, *13*(3), 253–263.

Du, Z., Zeng, A., Dong, Y., & Tang, J. (2024). Understanding emergent abilities of language models from the loss perspective. *arXiv Preprint arXiv:2403.15796*.

Favero, C. A., & Fernández-Fuertes, R. (2025). Towards Data-Congruent Models of the Term Structure of Interest Rates. *Econometric Reviews*, 1–23.

Feng, S., Ding, W., Liu, A., Wang, Z., Shi, W., Wang, Y., Shen, Z., Han, X., Lang, H., Lee, C.-Y., et al. (2025). When One LLM Drools, Multi-LLM Collaboration Rules. *arXiv Preprint arXiv:2502.04506*.

Feng, T., Trinh, T. H., Bingham, G., Hwang, D., Chervonyi, Y., Jung, J., Lee, J., Pagano, C., Kim, S., Pasqualotto, F., et al. (2026). Towards Autonomous Mathematics Research. *arXiv Preprint arXiv:2602.10177*.

Fleming, M. J., Mizrach, B., & Nguyen, G. (2018). The Microstructure of a US Treasury ECN: The BrokerTec platform. *Journal of Financial Markets*, *40*, 2–22.

Fujiwara, M., Suimon, Y., & Nakagawa, K. (2023). Treasury yield spread prediction with sentiments of Beige Book and macroeconomic data. *2023 14th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI)*, 337–342.

Gambacorta, L., Kwon, B., Park, T., Patelli, P., & Zhu, S. (2024). *CB-LMs: Language Models for Central Banking*. Bank for International Settlements, Monetary; Economic Department.

Gertler, M., & Karadi, P. (2015). Monetary Policy Surprises, Credit Costs, and Economic Activity. *American Economic Journal: Macroeconomics*, *7*(1), 44–76.

Glasserman, P., & Lin, C. (2024). Assessing Look-Ahead Bias in Stock Return Predictions Generated by GPT Sentiment Analysis. *The Journal of Financial Data Science*, *6*(1), 25–42.

Governors of the Federal Reserve System, B. of. (2025). *The Beige Book: Summary of Commentary on Current Economic Conditions by Federal Reserve District, february 2025* \[Beige Book\]. Board of Governors of the Federal Reserve System.

Gu, J., Pang, L., Shen, H., & Cheng, X. (2024). Do LLMs play dice? Exploring probability distribution sampling in large language models for behavioral simulation. *arXiv Preprint arXiv:2404.09043*.

Guo, T., Chen, X., Wang, Y., Chang, R., Pei, S., Chawla, N. V., Wiest, O., & Zhang, X. (2024). Large Language Model based Multi-Agents: A Survey of Progress and Challenges. *International Joint Conference on Artificial Intelligence (IJCAI)*.

Gürkaynak, R. S., Sack, B., & Swanson, E. (2005). The Sensitivity of Long-Term Interest Rates to Economic News: Evidence and Implications for Macroeconomic Models. *American Economic Review*, *95*(1), 425–436.

Hack, L., Istrefi, K., & Meier, M. (2024). *The Systematic Origins of Monetary Policy Shocks* \[Unpublished manuscript\].

Hansen, A. L., & Kazinnik, S. (2023). Can ChatGPT Decipher Fedspeak. *Available at SSRN*.

Hansen, S., & McMahon, M. (2016). Shocking language: Understanding the macroeconomic effects of central bank communication. *Journal of International Economics*, *99*, S114–S133.

Hanson, S. G., & Stein, J. C. (2012). Monetary Policy and Long-Term Real Rates. *Finance and Economics Discussion Series*, (2012-46).

Holm, S. (1979). A simple sequentially rejective multiple test procedure. *Scandinavian Journal of Statistics*, *6*(2), 65–70.

Jarociński, M. (2024). Estimating the Fed’s unconventional policy shocks. *Journal of Monetary Economics*, *144*, 103548.

Jarociński, M., & Karadi, P. (2020). Deconstructing Monetary Policy Surprises—The Role of Information Shocks. *American Economic Journal: Macroeconomics*, *12*(2), 1–43.

Jarociński, M., & Karadi, P. (2025). *Disentangling Monetary Policy, Central Bank Information, and Fed Response to News Shocks* (CEPR Discussion Paper No. 19923). Centre for Economic Policy Research.

Jiang, H. (2023). A Latent Space Theory for Emergent Abilities in Large Language Models. *arXiv Preprint arXiv:2304.09960*.

Jordà, Ò. (2005). Estimation and inference of impulse responses by local projections. *American Economic Review*, *95*(1), 161–182.

Jordà, Ò., & Taylor, A. M. (2025). Local projections. *Journal of Economic Literature*, *63*(1), 59–110.

Kim, A., Muhn, M., & Nikolaev, V. (2024). Financial statement analysis with large language models. *arXiv Preprint arXiv:2407.17866*.

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). *Large Language Models are Zero-Shot Reasoners*.

Kuttner, K. N. (2001). Monetary Policy Surprises and Interest Rates: Evidence from the Fed Funds Futures Market. *Journal of Monetary Economics*, *47*(3), 523–544.

Leeper, E. M. (1997). Narrative and VAR Approaches to Monetary Policy: Common Identification Problems. *Journal of Monetary Economics*, *40*(3), 641–657.

Li, J., Zhang, Q., Yu, Y., Fu, Q., & Ye, D. (2024). More Agents Is All You Need. *arXiv Preprint arXiv:2402.05120*.

Lopez-Lira, A. (2025). *Can Large Language Models trade? Testing financial theories with LLM Agents in Market Simulations* \[Working Paper\]. University of Florida.

Lopez-Lira, A., & Tang, Y. (2023). Can ChatGPT Forecast Stock Price Movements? Return Predictability and Large Language Models. *arXiv Preprint arXiv:2304.07619*.

Lucca, D. O., & Moench, E. (2015). The Pre-FOMC Announcement Drift. *The Journal of Finance*, *70*(1), 329–371.

Mertens, K., & Ravn, M. O. (2013). The Dynamic Effects of Personal and Corporate Income Tax Changes in the United States. *American Economic Review*, *103*(4), 1212–1247.

Mincer, J. A., & Zarnowitz, V. (1969). The Evaluation of Economic Forecasts. In J. A. Mincer (Ed.), *Economic forecasts and expectations: Analysis of forecasting behavior and performance* (pp. 3–46). National Bureau of Economic Research.

Miranda-Agrippino, S., & Ricco, G. (2021). The transmission of monetary policy shocks. *American Economic Journal: Macroeconomics*, *13*(3), 74–107.

Montiel Olea, J. L., & Pflueger, C. (2013). A robust test for weak instruments. *Journal of Business & Economic Statistics*, *31*(3), 358–369.

Montiel Olea, J. L., & Plagborg-Møller, M. (2021). Local projection inference is simpler and more robust than you think. *Econometrica*, *89*(4), 1789–1823.

Nakamura, E., & Steinsson, J. (2018). High-Frequency Identification of Monetary Non-Neutrality: The Information Effect. *The Quarterly Journal of Economics*, *133*(3), 1283–1330.

Noguer i Alonso, M. (2024). *Look-ahead Bias in Large Language Models (LLMs): Implications and Applications in Finance* \[Working Paper\]. Artificial Intelligence in Finance Institute.

Peskoff, D., Visokay, A., Schulhoff, S., Wachspress, B., Blinder, A., & Stewart, B. M. (2024). GPT deciphering fedspeak: Quantifying dissent among hawks and doves. *arXiv Preprint arXiv:2407.19110*.

Plagborg-Møller, M., & Wolf, C. K. (2021). Local projections and VARs estimate the same impulse responses. *Econometrica*, *89*(2), 955–980.

Politis, D. N., & Romano, J. P. (1994). The stationary bootstrap. *Journal of the American Statistical Association*, *89*(428), 1303–1313.

Poole, W. (2001). Expectations. *Federal Reserve Bank of St. Louis Review*, *83*(March/April 2001).

Ramey, V. A. (2016). Macroeconomic Shocks and their Propagation. *Handbook of Macroeconomics*, *2*, 71–162.

Ricco, G., & Savini, E. (2025). *Decomposing Monetary Policy Surprises: Shock, Information, and Policy Rule Revision* (CEPR Discussion Paper No. 20166). Centre for Economic Policy Research.

Romer, C. D., & Romer, D. H. (2004). A New Measure of Monetary Shocks: Derivation and Implications. *American Economic Review*, *94*(4), 1055–1084.

Romer, C. D., & Romer, D. H. (1989). Does Monetary Policy Matter? A New Test in the Spirit of Friedman and Schwartz. *NBER Macroeconomics Annual*, *4*, 121–184.

Romer, C. D., & Romer, D. H. (2000). Federal Reserve information and the behavior of interest rates. *American Economic Review*, *90*(3), 429–457.

Rudebusch, G. D. (2002). Term structure evidence on interest rate smoothing and monetary policy inertia. *Journal of Monetary Economics*, *49*(6), 1161–1187.

Sarkar, S. K., & Vafa, K. (2024). *Lookahead Bias in Pretrained Language Models* \[Working Paper\]. Harvard University.

Schoenegger, P., Park, P. S., Karger, E., Trott, S., & Tetlock, P. E. (2025). AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy. *ACM Transactions on Interactive Intelligent Systems*, *15*(1), 1–25.

Shapiro, A. H., Sudhof, M., & Wilson, D. J. (2022). Measuring News Sentiment. *Journal of Econometrics*, *228*(2), 221–243.

Shi, J., & Hollifield, B. (2024). Predictive Power of LLMs in Financial Markets. *arXiv Preprint arXiv:2411.16569*.

Sims, C. A. (1980). Macroeconomics and Reality. *Econometrica: Journal of the Econometric Society*, 1–48.

Sims, C. A. (1992). Interpreting the Macroeconomic Time Series Facts: The Effects of Monetary Policy. *European Economic Review*, *36*(5), 975–1000.

Sreedhar, K., & Chilton, L. (2024). Simulating human strategic behavior: Comparing single and multi-agent llms. *arXiv Preprint arXiv:2402.08189*.

Stock, J. H., & Watson, M. W. (2012). Disentangling the Channels of the 2007-09 Recession. *Brookings Papers on Economic Activity*, *43*(1), 81–156.

Stock, J. H., & Watson, M. W. (2001). Vector autoregressions. *Journal of Economic Perspectives*, *15*(4), 101–115.

Stock, J. H., & Watson, M. W. (2018). Identification and estimation of dynamic causal effects in macroeconomics using external instruments. *The Economic Journal*, *128*(610), 917–948.

Svensson, L. E. O. (2003). What is wrong with Taylor rules? Using judgment in monetary policy through targeting rules. *Journal of Economic Literature*, *41*(2), 426–477.

Svensson, L. E., & Woodford, M. (2003). Indicator variables for optimal policy. *Journal of Monetary Economics*, *50*(3), 691–720.

Swanson, E. T., & Williams, J. C. (2014). Measuring the Effect of the Zero Lower Bound on Medium- and Longer-Term Interest Rates. *American Economic Review*, *104*(10), 3154–3185.

Talebirad, Y., & Nadiri, A. (2023). Multi-agent collaboration: Harnessing the power of intelligent llm agents. *arXiv Preprint arXiv:2306.03314*.

Taylor, J. B. (1993). Discretion versus Policy Rules in Practice. *Carnegie-Rochester Conference Series on Public Policy*, *39*, 195–214.

Team, A. S. (2025). MaRGen: Multi-agent LLM approach for self-directed market research and analysis. *LLM4ECommerce Workshop, KDD 2025*.

Tillmann, A. (2025). Literature Review of Multi-Agent Debate for Problem-Solving. *arXiv Preprint arXiv:2506.00066*.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is All You Need. *Advances in Neural Information Processing Systems*, *30*.

Villota Miranda, J. (2024). Predicting Market Reactions to News: An LLM-Based Approach Using Spanish Business Articles. *Generative AI in Finance Conference,(john Molson School of Business, Montreal)*.

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2022). *Self-Consistency Improves Chain-of-Thought Reasoning in Language Models*.

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al. (2022). Emergent abilities of large language models. *arXiv Preprint arXiv:2206.07682*.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., & Zhou, D. (2022). *Chain-of-Thought Prompting Elicits Reasoning in Large Language Models*.

West, K. D. (1996). Asymptotic inference about predictive ability. *Econometrica*, *64*(5), 1067–1084.

White, N. (2025). *The New Keynesian Price Puzzle: Reinterpreting Inflation Dynamics* \[Working Paper\]. Amherst College.

Wu, J. C., & Xia, F. D. (2016). Measuring the macroeconomic impact of monetary policy at the zero lower bound. *Journal of Money, Credit and Banking*, *48*(2–3), 253–291.

Wu, Z., Bai, H., Zhang, A., Gu, J., Vydiswaran, V., Jaitly, N., & Zhang, Y. (2024). Divide-or-Conquer? Which Part Should You Distill Your LLM? *arXiv Preprint arXiv:2402.15000*.

Yang, S., Li, Y., Lam, W., & Cheng, Y. (2025). Multi-llm collaborative search for complex problem solving. *arXiv Preprint arXiv:2502.18873*.

Yu, Y., Yao, Z., Li, H., Deng, Z., Jiang, Y., Cao, Y., Chen, Z., Suchow, J. W., Cui, Z., Liu, R., Xu, Z., Zhang, D., Subbalakshmi, K., Xiong, G., He, Y., Huang, J., Li, D., & Xie, Q. (2024). FinCon: A synthesized LLM multi-agent system with conceptual verbal reinforcement for enhanced financial decision making. *arXiv Preprint arXiv:2407.06567*.

Zhang, A. L., Kraska, T., & Khattab, O. (2025). Recursive Language Models. *arXiv Preprint arXiv:2512.24601*.

Zhu, K., Du, H., Hong, Z., Yang, X., Guo, S., Wang, Z., Wang, Z., Qian, C., Tang, X., Ji, H., & You, J. (2025). MultiAgentBench: Evaluating the collaboration and competition of LLM agents. *Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL)*, 8580–8622.