参考文献¶
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, and et al. TensorFlow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 265–283. 2016.
Ossama Abdel-Hamid, Abdel-Rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10):1533–1545, 2014.
Amr Ahmed, Moahmed Aly, Joseph Gonzalez, Shravan Narayanamurthy, and Alexander J Smola. Scalable inference in latent variable models. In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, 123–132. ACM, 2012.
T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama. Optuna: a next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019.
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, and et al. Flamingo: a visual language model for few-shot learning. ArXiv:2204.14198, 2022.
Bilal Alsallakh, Narine Kokhlikyan, Vivek Miglani, Jun Yuan, and Orion Reblitz-Richardson. Mind the PAD – CNNs can develop blind spots. ArXiv:2010.02178, 2020.
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, and et al. PaLM 2 Technical Report. ArXiv:2305.10403, 2023.
Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, and Yoram Singer. Scalable second-order optimization for deep learning. ArXiv:2002.09018, 2020.
Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337–404, 1950.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ArXiv:1607.06450, 2016.
Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. In International Conference on Learning Representations. 2018.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. ArXiv:1409.0473, 2014.
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, and et al. Constitutional AI: harmlessness from AI feedback. ArXiv:2212.08073, 2022.
R. Baptista and M. Poloczek. Bayesian optimization of combinatorial structures. In Proceedings of the 35th International Conference on Machine Learning. 2018.
R. Bardenet, M. Brendel, B. Kégl, and M. Sebag. Collaborative hyperparameter tuning. In Proceedings of the 30th International Conference on Machine Learning (ICML'13). 2013.
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. SURF: Speeded up robust features. In European Conference on Computer Vision, 404–417. Springer, 2006.
R. Bellman. Dynamic programming. Science, 153:34–37, 1966.
Richard Bellman. On the theory of dynamic programming. Proceedings of the National Academy of Sciences, 38(8):716–719, 1952.
Richard Bellman. A Markovian decision process. Journal of Mathematics and Mechanics, 6(5):679–684, 1957. URL: http://www.jstor.org/stable/24900506 (visited on 2022-11-28).
Richard Bellman. Dynamic Programming. Dover. Dover Publications, 1957. ISBN 9780486317199.
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: the long-document transformer. ArXiv:2004.05150, 2020.
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb):1137–1155, 2003.
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization. Advances in Neural Information Processing Systems, 2011.
James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a CPU and GPU math compiler in Python. In Proc. 9th Python in Science Conference, volume 1, 3–10. 2010.
Alex Beutel, Kenton Murray, Christos Faloutsos, and Alexander J Smola. CoBaFi: collaborative Bayesian filtering. In Proceedings of the 23rd International Conference on World Wide Web, 97–108. 2014.
Chris M Bishop. Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1):108–116, 1995.
Christopher M Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
Fischer Black and Myron Scholes. The pricing of options and corporate liabilities. Journal of Political Economy, 81:637–654, 1973.
Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-NMS-improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision, 5561–5569. 2017.
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017.
B Bollobás. Linear Analysis. Cambridge University Press, 1999.
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, and et al. On the opportunities and risks of foundation models. ArXiv:2108.07258, 2021.
Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010, pages 177–186. Springer, 2010.
Léon Bottou and Yann Le Cun. SN: a simulator for connectionist models. In Proceedings of NeuroNimes 88, 371–382. Nimes, France, 1988. URL: http://leon.bottou.org/papers/bottou-lecun-88.
Stéphane Boucheron, Olivier Bousquet, and Gábor Lugosi. Theory of classification: a survey of some recent advances. ESAIM: Probability and Statistics, 9:323–375, 2005.
Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning natural language inference. ArXiv:1508.05326, 2015.
Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, England, 2004.
Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
Noam Brown and Tuomas Sandholm. Libratus: the superhuman AI for no-limit poker. In IJCAI, 5226–5228. 2017.
Peter F Brown, John Cocke, Stephen A Della Pietra, Vincent J Della Pietra, Frederick Jelinek, John Lafferty, Robert L Mercer, and Paul S Roossin. A statistical approach to machine translation. Computational Linguistics, 16(2):79–85, 1990.
Peter F Brown, John Cocke, Stephen A Della Pietra, Vincent J Della Pietra, Frederick Jelinek, Robert L Mercer, and Paul Roossin. A statistical approach to language translation. In COLING Budapest 1988 Volume 1: International Conference on Computational Linguistics. 1988.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
Alexander Buslaev, Vladimir I Iglovikov, Eugene Khvedchenya, Alex Parinov, Mikhail Druzhinin, and Alexandr A Kalinin. Albumentations: Fast and flexible image augmentations. Information, 11(2):125, 2020.
Murray Campbell, A Joseph Hoane Jr, and Feng-hsiung Hsu. Deep blue. Artificial Intelligence, 134(1-2):57–83, 2002.
John Canny. A computational approach to edge detection. In Readings in Computer Vision, pages 184–203. Elsevier, 1987.
Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. SemEval-2017 Task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 1–14. 2017.
William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. Listen, attend and spell. ArXiv:1508.01211, 2015.
Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: reinforcement learning via sequence modeling. Advances in Neural Information Processing Systems, 34:15084–15097, 2021.
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. MXNET: a flexible and efficient machine learning library for heterogeneous distributed systems. ArXiv:1512.01274, 2015.
Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 551–561. 2016.
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. CuDNN: Efficient primitives for deep learning. ArXiv:1410.0759, 2014.
Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder–decoder approaches. ArXiv:1409.1259, 2014.
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder–decoder for statistical machine translation. ArXiv:1406.1078, 2014.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, and et al. PaLM: scaling language modeling with pathways. ArXiv:2204.02311, 2022.
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. ArXiv:1412.3555, 2014.
Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. ELECTRA: pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations. 2020.
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12:2493–2537, 2011.
Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi. On the relationship between self-attention and convolutional layers. In International Conference on Learning Representations. 2020.
T Cover and JM Thomas. Elements of Information Theory. John Wiley & Sons, 1999.
Imre Csiszár. Axiomatic characterizations of information measures. Entropy, 10(3):261–273, 2008.
George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303–314, 1989.
Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), volume 1, 886–893. IEEE, 2005.
Dean De Cock. Ames, Iowa: alternative to the Boston housing data as an end of semester regression project. Journal of Statistics Education, 2011.
Jeffrey Dean, Greg S Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V Le, Mark Z Mao, Marc'Aurelio Ranzato, Andrew Senior, Paul Tucker, and et al. Large scale distributed deep networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Volume 1, 1223–1231. 2012.
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon's highly available key-value store. In ACM SIGOPS Operating Systems Review, volume 41, 205–220. ACM, 2007.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255. IEEE, 2009.
Armen Der Kiureghian and Ove Ditlevsen. Aleatory or epistemic? does it matter? Structural Safety, 31(2):105–112, 2009.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. ArXiv:1810.04805, 2018.
Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, 1422–1430. 2015.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, and et al. An image is worth 16 x 16 words: transformers for image recognition at scale. In International Conference on Learning Representations. 2021.
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
Vincent Dumoulin and Francesco Visin. A guide to convolution arithmetic for deep learning. ArXiv:1603.07285, 2016.
Vijay Prakash Dwivedi and Xavier Bresson. A generalization of transformer networks to graphs. ArXiv:2012.09699, 2020.
Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon Roth. Preserving statistical validity in adaptive data analysis. In Proceedings of the 47th Annual ACM Symposium on Theory of Computing, 117–126. 2015.
Jeffrey L Elman. Finding structure in time. Cognitive Science, 14(2):179–211, 1990.
T. Elsken, J. H. Metzen, and F. Hutter. Neural architecture search: a ssurvey. ArXiv:1808.05377 [stat.ML], 2018.
Gustav Theodor Fechner. Elemente der Psychophysik. Volume 2. Breitkopf u. Härtel, 1860.
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
Randima Fernando. GPU Gems: Programming Techniques, Tips, and Tricks for Real-Time Graphics. Addison-Wesley, 2004.
M. Feurer and F. Hutter. Hyperparameter ptimization. In Automatic Machine Learning: Methods, Systems, Challenges. Springer, 2018.
M. Feurer, B. Letham, F. Hutter, and E. Bakshy. Practical transfer learning for Bayesian optimization. ArXiv:1802.02219 [stat.ML], 2022.
David J Field. Relations between the statistics of natural images and the response properties of cortical cells. JOSA A, 4(12):2379–2394, 1987.
R A Fisher. Statistical Methods for Research Workers. Oliver & Boyd, 1925.
Nicolas Flammarion and Francis Bach. From averaging to acceleration, there is only a step-size. In Conference on Learning Theory, 658–695. 2015.
Alexander IJ Forrester, András Sóbester, and Andy J Keane. Multi-fidelity optimization via surrogate modelling. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 463(2088):3251–3269, 2007.
L. Franceschi, M. Donini, P. Frasconi, and M. Pontil. Forward and reverse gradient-based hyperparameter optimization. In Proceedings of the 34th International Conference on Machine Learning (ICML'17). 2017.
Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: finding sparse, trainable neural networks. ArXiv:1803.03635, 2018.
Peter I Frazier. A tutorial on Bayesian optimization. ArXiv:1807.02811, 2018.
Yoav Freund and Robert E Schapire. Experiments with a new boosting algorithm. In Proceedings of the International Conference on Machine Learning, volume 96, 148–156. Citeseer, 1996.
Jerome H Friedman. Exploratory projection pursuit. Journal of the American Statistical Association, 82(397):249–266, 1987.
Roy Frostig, Matthew James Johnson, and Chris Leary. Compiling machine learning programs via high-level tracing. In Proceedings of Systems for Machine Learning. 2018.
Kunihiko Fukushima. Neocognitron: a self-organizing neural network model for a mechanism of visual pattern recognition. In Competition and Cooperation in Neural Nets, pages 267–285. Springer, 1982.
Jacob Gardner, Geoff Pleiss, Kilian Q Weinberger, David Bindel, and Andrew G Wilson. GPyTorch: blackbox matrix–matrix Gaussian process inference with GPU acceleration. In Advances in Neural Information Processing Systems, volume 31. 2018.
Saurabh Garg, Sivaraman Balakrishnan, Zico Kolter, and Zachary Lipton. RATT: leveraging unlabeled data to guarantee generalization. In International Conference on Machine Learning, 3598–3609. PMLR, 2021.
Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2414–2423. 2016.
Carl Friedrich Gauss. Theoria motus corporum coelestum. In Werke. Königlich Preussische Akademie der Wissenschaften, 1809.
Josiah Willard Gibbs. Elementary Principles of Statistical Mhanics. Scribner's, 1902.
Jean Ginibre. Statistical ensembles of complex, quaternion, and real matrices. Journal of Mathematical Physics, 6(3):440–449, 1965.
Ross Girshick. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, 1440–1448. 2015.
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 580–587. 2014.
Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 249–256. 2010.
Gabriel Goh. Why momentum really works. Distill, 2017. URL: http://distill.pub/2017/momentum.
David Goldberg, David Nichols, Brian M Oki, and Douglas Terry. Using collaborative filtering to weave an information tapestry. Communications of the ACM, 35(12):61–71, 1992.
Gene H Golub and Charles F Van Loan. Matrix Computations. Johns Hopkins University Press, 1996.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2672–2680. 2014.
Akhilesh Gotmare, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. A closer look at deep learning heuristics: learning rate restarts, warmup and distillation. ArXiv:1810.13243, 2018.
Ankit Goyal, Alexey Bochkovskiy, Jia Deng, and Vladlen Koltun. Non-deep networks. ArXiv:2110.07641, 2021.
Benjamin Graham. Fractional max-pooling. ArXiv:1412.6071, 2014.
Alex Graves. Generating sequences with recurrent neural networks. ArXiv:1308.0850, 2013.
Alex Graves, Marcus Liwicki, Santiago Fernández, Roman Bertolami, Horst Bunke, and Jürgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5):855–868, 2008.
Alex Graves and Jürgen Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5-6):602–610, 2005.
Andreas Griewank. On automatic differentiation. In Mathematical Programming: Recent Developments and Applications, pages 83–107. Kluwer, 1989.
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and et al. Conformer: convolution-augmented transformer for speech recognition. Proc. Interspeech 2020, pages 5036–5040, 2020.
Asela Gunawardana and Guy Shani. Evaluating recommender systems. In Recommender Systems Handbook, pages 265–308. Springer, 2015.
Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: a factorization-machine based neural network for ctr prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, 1725–1731. AAAI Press, 2017.
Isabelle Guyon, Steve Gunn, Masoud Nikravesh, and Lofti A Zadeh. Feature Extraction: Foundations and Applications. Springer, 2008.
Stefan Hadjis, Ce Zhang, Ioannis Mitliagkas, Dan Iter, and Christopher Ré. Omnivore: an optimizer for multi-device deep learning on CPUs and GPUs. ArXiv:1606.04487, 2016.
Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2000.
Richard I Hartley and Fredrik Kahl. Global optimization through rotation space search. International Journal of Computer Vision, 82(1):64–79, 2009.
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16000–16009. 2022.
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, 2961–2969. 2017.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision, 1026–1034. 2015.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778. 2016.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, 630–645. Springer, 2016.
Xiangnan He and Tat-Seng Chua. Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 355–364. ACM, 2017.
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, 173–182. International World Wide Web Conferences Steering Committee, 2017.
Donald Olding Hebb. The Organization of Behavior. Wiley, 1949.
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs). ArXiv:1606.08415, 2016.
John L Hennessy and David A Patterson. Computer Architecture: A Quantitative Approach. Elsevier, 2011.
Jonathan L Herlocker, Joseph A Konstan, Al Borchers, and John Riedl. An algorithmic framework for performing collaborative filtering. In 22nd Annual International ACM Conference on Research and Development in Information Retrieval, SIGIR 1999, 230–237. Association for Computing Machinery, Inc, 1999.
Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Session-based recommendations with recurrent neural networks. ArXiv:1511.06939, 2015.
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press, 2001.
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, and et al. Training compute-optimal large language models. ArXiv:2203.15556, 2022.
Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, and Hartwig Adam. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1314–1324. 2019.
Patrik O Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Schölkopf. Nonlinear causal discovery with additive noise models. In Advances in Neural Information Processing Systems, 689–696. 2009.
Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7132–7141. 2018.
Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative filtering for implicit feedback datasets. In 2008 8th IEEE International Conference on Data Mining, 263–272. IEEE, 2008.
Zhiqiang Hu, Roy Ka-Wei Lee, Charu C. Aggarwal, and Aston Zhang. Text style transfer: a review and experimental evaluation. SIGKDD Explor. Newsl., 2022. URL: https://doi.org/10.1145/3544903.3544906.
Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M Dai, Matthew D Hoffman, Monica Dinculescu, and Douglas Eck. Music transformer: generating music with long-term structure. In International Conference on Learning Representations. 2018.
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4700–4708. 2017.
Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional LSTM–CRF models for sequence tagging. ArXiv:1508.01991, 2015.
David H Hubel and Torsten N Wiesel. Receptive fields of single neurones in the cat's striate cortex. Journal of Physiology, 148(3):574–591, 1959.
David H Hubel and Torsten N Wiesel. Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. Journal of Physiology, 160(1):106–154, 1962.
David H Hubel and Torsten N Wiesel. Receptive fields and functional architecture of monkey striate cortex. Journal of Physiology, 195(1):215–243, 1968.
F. Hutter, H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In Proceedings of the Fifth International Conference on Learning and Intelligent Optimization (LION'11). 2011.
F. Hutter, L. Kotthoff, and J. Vanschoren, editors. Automated Machine Learning: Methods, Systems, Challenges. Springer, 2019.
Sergey Ioffe. Batch renormalization: towards reducing minibatch dependence in batch-normalized models. In Advances in Neural Information Processing Systems, 1945–1953. 2017.
Sergey Ioffe and Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. ArXiv:1502.03167, 2015.
Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. ArXiv:1803.05407, 2018.
Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, volume 31. 2018.
Herbert Jaeger. Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the “echo state network” approach. GMD-Forschungszentrum Informationstechnik Bonn, 2002.
K. Jamieson and A. Talwalkar. Non-stochastic best arm identification and hyperparameter optimization. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics. 2016.
R. Jenatton, C. Archambeau, J. González, and M. Seeger. Bayesian optimization with tree-structured dependencies. In Proceedings of the 34th International Conference on Machine Learning (ICML'17). 2017.
Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, and et al. Highly scalable deep learning training system with mixed-precision: training ImageNet in four minutes. ArXiv:1807.11205, 2018.
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia, 675–678. 2014.
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. SpanBERT: improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77, 2020.
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, and et al. In-datacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), 1–12. IEEE, 2017.
Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network for modelling sentences. ArXiv:1404.2188, 2014.
Barry L Kalman and Stan C Kwasny. Why tanh: choosing a sigmoidal function. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), 578–581. IEEE, 1992.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. ArXiv:2001.08361, 2020.
Z. Karnin, T. Koren, and O. Somekh. Almost optimal exploration in multi-armed bandits. In Proceedings of the 30th International Conference on Machine Learning (ICML'13). 2013.
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. ArXiv:1710.10196, 2017.
Jaeyoung Kim, Mostafa El-Khamy, and Jungwon Lee. Residual LSTM: design of a deep recurrent architecture for distant speech recognition. ArXiv:1701.03360, 2017.
Yoon Kim. Convolutional neural networks for sentence classification. ArXiv:1408.5882, 2014.
G. S. Kimeldorf and G. Wahba. Some results on Tchebycheffian spline functions. J. Math. Anal. Appl., 33:82–95, 1971.
Diederik P Kingma and Jimmy Ba. Adam: a method for stochastic optimization. ArXiv:1412.6980, 2014.
Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. ArXiv:1609.02907, 2016.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arxiv.org/abs/2205.11916, 2022.
Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
Andrey Kolmogorov. Sulla determinazione empirica di una legge di distribuzione. Inst. Ital. Attuari, Giorn., 4:83–91, 1933.
Zico Kolter. Linear algebra review and reference. Available online: http://cs229.stanford.edu/section/cs229-linalg.pdf, 2008.
Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, pages 30–37, 2009.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 1097–1105. 2012.
Sun Yuan Kung. VLSI Array Processors. Prentice Hall, 1988.
Ilya Kuzovkin, Raul Vicente, Mathilde Petton, Jean-Philippe Lachaux, Monica Baciu, Philippe Kahane, Sylvain Rheims, Juan R Vidal, and Jaan Aru. Activations of deep convolutional neural networks are aligned with gamma band activity of human visual cortex. Communications Biology, 1(1):1–12, 2018.
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: a lite BERT for self-supervised learning of language representations. ArXiv:1909.11942, 2019.
Andrew Lavin and Scott Gray. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4013–4021. 2016.
Quoc V Le. Building high-level features using large scale unsupervised learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 8595–8598. IEEE, 2013.
Yann LeCun, Yoshua Bengio, and et al. Convolutional networks for images, speech, and time series. In The Handbook of Brain Theory and Neural Networks, pages 3361. MIT Press, 1995.
Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989.
Yann LeCun, Leon Bottou, G Orr, and Klaus-Robert Muller. Efficient backprop. In Neural Networks: Tricks of the Trade. Springer, 1998.
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Yann LeCun, LD Jackel, Leon Bottou, A Brunot, Corinna Cortes, John Denker, Harris Drucker, Isabelle Guyon, UA Muller, Eduard Sackinger, and et al. Comparison of learning algorithms for handwritten digit recognition. In International Conference on Artificial Neural Networks, 53–60. 1995.
Adrien Marie Legendre. Mémoire sur les Opérations Trigonométriques: dont les Résultats Dépendent de la Figure de la Terre. F. Didot, 1805.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. ArXiv:1910.13461, 2019.
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, and et al. Solving quantitative reasoning problems with language models. ArXiv:2206.14858, 2022.
L. Li, K. Jamieson, A. Rostamizadeh, K. Gonina, M. Hardt, B. Recht, and A. Talwalkar. Massively parallel hyperparameter tuning. ArXiv:1810.05934, 2018.
Mu Li. Scaling Distributed Machine Learning with System and Algorithm Co-design. PhD thesis, PhD Thesis, CMU, 2017.
Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In 11th Symposium on Operating Systems Design and Implementation (OSDI 14), 583–598. 2014.
Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 661–670. 2014.
R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. Gonzalez, and I. Stoica. Tune: a research platform for distributed model selection and training. ArXiv:1807.05118, 2018.
Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. ArXiv:1312.4400, 2013.
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, 2980–2988. 2017.
Yuanqing Lin, F Lv, S Zhu, M Yang, T Cour, K Yu, L Cao, Z Li, MH Tsai, X Zhou, and others. ImageNet classification: fast descriptor coding and large-scale SVM training. Large Scale Visual Recognition Challenge, 2010.
Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. ArXiv:1703.03130, 2017.
Zachary C Lipton, John Berkowitz, and Charles Elkan. A critical review of recurrent neural networks for sequence learning. ArXiv:1506.00019, 2015.
Zachary C Lipton, David C Kale, Charles Elkan, and Randall Wetzel. Learning to diagnose with LSTM recurrent neural networks. In International Conference on Learning Representations (ICLR). 2016.
Zachary C Lipton and Jacob Steinhardt. Troubling trends in machine learning scholarship. Communications of the ACM, 17:45–77, 2018.
Dong C Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1):503–528, 1989.
Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: differentiable architecture search. ArXiv:1806.09055, 2018.
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. SSD: single shot multibox detector. In European Conference on Computer Vision, 21–37. Springer, 2016.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: a robustly optimized BERT pretraining approach. ArXiv:1907.11692, 2019.
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012–10022. 2021.
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convNet for the 2020s. ArXiv:2201.03545, 2022.
Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3431–3440. 2015.
Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with warm restarts. ArXiv:1608.03983, 2016.
David G Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
Ping Luo, Xinjiang Wang, Wenqi Shao, and Zhanglin Peng. Towards understanding regularization in batch normalization. ArXiv:1809.00846, 2018.
Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Volume 1, 142–150. Association for Computational Linguistics, 2011.
Yue-Pok Mack and Bernard W Silverman. Weak and strong uniform consistency of kernel regression estimates. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 61(3):405–415, 1982.
David JC MacKay. Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003.
D. Maclaurin, D. Duvenaud, and R. Adams. Gradient-based hyperparameter optimization through reversible learning. In Proceedings of the 32nd International Conference on Machine Learning (ICML'15). 2015.
O. L. Mangasarian. Linear and nonlinear separation of patterns by linear programming. Oper. Res., 13:444–452, 1965.
Myles E Mangram. A simplified perspective of the Markowitz portfolio theory. Global Journal of Business Research, 7(1):59–70, 2013.
Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. ArXiv:1804.11271, 2018.
Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, 6294–6305. 2017.
Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5(4):115–133, 1943.
H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, and et al. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1222–1230. ACM, 2013.
Carver Mead. Introduction to VLSI systems. IEE Proceedings I-Solid-State and Electron Devices, 128(1):18, 1980.
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. ArXiv:1609.07843, 2016.
Charles A Micchelli. Interpolation of scattered data: distance matrices and conditionally positive definite functions. In Approximation Theory and Spline Functions, pages 143–145. Springer, 1984.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. ArXiv:1301.3781, 2013.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, 3111–3119. 2013.
George A Miller. WordNet: a lexical database for English. Communications of the ACM, 38(11):39–41, 1995.
Azalia Mirhoseini, Hieu Pham, Quoc V Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. Device placement optimization with reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, 2430–2439. 2017.
Volodymyr Mnih, Nicolas Heess, Alex Graves, and others. Recurrent models of visual attention. In Advances in Neural Information Processing Systems, 2204–2212. 2014.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning. ArXiv:1312.5602, 2013.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, and et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
Taesup Moon, Alex Smola, Yi Chang, and Zhaohui Zheng. Intervalrank: isotonic regression with listwise and pairwise constraints. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, 151–160. 2010.
Richard D Morey, Rink Hoekstra, Jeffrey N Rouder, Michael D Lee, and Eric-Jan Wagenmakers. The fallacy of placing confidence in confidence intervals. Psychonomic Bulletin & Review, 23(1):103–123, 2016.
Vladimir Alekseevich Morozov. Methods for Solving Incorrectly Posed Problems. Springer, 1984.
Elizbar A Nadaraya. On estimating regression. Theory of Probability & its Applications, 9(1):141–142, 1964.
Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted Boltzmann machines. In ICML. 2010.
Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021.
Moni Naor and Omer Reingold. On the construction of pseudorandom permutations: Luby–Rackoff revisited. Journal of Cryptology, 12(1):29–66, 1999.
Radford M Neal. Bayesian Learning for Neural Networks. Springer, 1996.
Yu Nesterov. Lectures on Convex Optimization. Springer, 2018.
Yu Nesterov and J-Ph Vial. Confidence level solutions for stochastic programming. Automatica, 44(6):1559–1568, 2000.
Jerzy Neyman. Outline of a theory of statistical estimation based on the classical theory of probability. Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences, 236(767):333–380, 1937.
Antonio Norelli, Marco Fumero, Valentino Maiorca, Luca Moschella, Emanuele Rodolà, and Francesco Locatello. ASIF: coupled data turns unimodal models to multimodal without training. ArXiv:2210.01738, 2022.
Roman Novak, Lechao Xiao, Jaehoon Lee, Yasaman Bahri, Greg Yang, Jiri Hron, Daniel A Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. Bayesian deep convolutional networks with many channels are Gaussian processes. ArXiv:1810.05148, 2018.
A. B. J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, 615–622. Polytechnic Institute of Brooklyn, 1962.
Bruno A Olshausen and David J Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609, 1996.
Cheng Soon Ong, Alexander Smola, and Robert Williamson. Learning the kernel with hyperkernels. Journal of Machine Learning Research, 6:1043–1071, 2005.
OpenAI. GPT-4 Technical Report. ArXiv:2303.08774, 2023.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and et al. Training language models to follow instructions with human feedback. ArXiv:2203.02155, 2022.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318. 2002.
Ankur P Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model for natural language inference. ArXiv:1606.01933, 2016.
Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2337–2346. 2019.
Emanuel Parzen. On consistent estimates of the spectrum of a stationary time series. Annals of Mathematical Statistics, 28:329–348, 1957.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, and et al. PyTorch: an imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32:8026–8037, 2019.
Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. ArXiv:1705.04304, 2017.
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. ArXiv:2306.01116, 2023.
Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In Advances in Neural Information Processing Systems, 4785–4795. 2017.
Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543. 2014.
Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press, 2017.
Matthew Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Volume 1, 1756–1765. 2017.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, 2227–2237. 2018.
Kaare Brandt Petersen and Michael Syskind Pedersen. The Matrix Cookbook. Technical University of Denmark, 2008.
Geoff Pleiss, Danlu Chen, Gao Huang, Tongcheng Li, Laurens Van Der Maaten, and Kilian Q Weinberger. Memory-efficient implementation of densenets. ArXiv:1707.06990, 2017.
Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
Aaditya Prakash, Sadid A Hasan, Kathy Lee, Vivek Datla, Ashequl Qadir, Joey Liu, and Oladimeji Farri. Neural paraphrase generation with stacked residual LSTM networks. ArXiv:1610.03098, 2016.
Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi Yang. Is ChatGPT a general-purpose natural language processing task solver? ArXiv:2302.06476, 2023.
Massimo Quadrana, Paolo Cremonesi, and Dietmar Jannach. Sequence-aware recommender systems. ACM Computing Surveys, 51(4):66, 2018.
J Ross Quinlan. C4.5: Programs for Machine Learning. Elsevier, 1993.
Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of Speech Recognition. Prentice-Hall., 1993.
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, and et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 8748–8763. PMLR, 2021.
Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ArXiv:1511.06434, 2015.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. OpenAI, 2018.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.
Ilija Radosavovic, Justin Johnson, Saining Xie, Wan-Yen Lo, and Piotr Dollár. On network design spaces for visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1882–1890. 2019.
Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10428–10436. 2020.
Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, and et al. Scaling language models: methods, analysis & insights from training gopher. ArXiv:2112.11446, 2021.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67, 2020.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. ArXiv:1606.05250, 2016.
Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jon Shlens. Stand-alone self-attention in vision models. Advances in Neural Information Processing Systems, 2019.
Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. ArXiv:1710.05941, 2017.
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. ArXiv:2204.06125, 2022.
Ramón y Cajal, Santiago and L. Azoulay. Les Nouvelles Idées sur la Structure du Système Nerveux chez l'Homme et chez les Vertébrés. Paris, C. Reinwald & Cie, 1894.
Marc-Aurelio Ranzato, Y-Lan Boureau, Sumit Chopra, and Yann LeCun. A unified energy-based framework for unsupervised learning. In Artificial Intelligence and Statistics, 371–379. PMLR, 2007.
Carl Edward Rasmussen and Christopher KI Williams. Gaussian Processes for Machine Learning. Number 3. MIT Press, 2006.
Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of Adam and beyond. ArXiv:1904.09237, 2019.
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 779–788. 2016.
Joseph Redmon and Ali Farhadi. YOLOv3: an incremental improvement. ArXiv:1804.02767, 2018.
Scott Reed and Nando De Freitas. Neural programmer-interpreters. ArXiv:1511.06279, 2015.
Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, and et al. A generalist agent. ArXiv:2205.06175, 2022.
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, 91–99. 2015.
Steffen Rendle. Factorization machines. In 2010 IEEE International Conference on Data Mining, 995–1000. IEEE, 2010.
Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, 452–461. AUAI Press, 2009.
Jarrett Revels, Miles Lubin, and Theodore Papamarkou. Forward-mode automatic differentiation in Julia. ArXiv:1607.07892, 2016.
Maximilian Riesenhuber and Tomaso Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience, 2(11):1019–1025, 1999.
R. T. Rockafellar. Convex Analysis. Princeton University Press, 1970.
David Rolnick, Andreas Veit, Serge Belongie, and Nir Shavit. Deep learning is robust to massive label noise. ArXiv:1705.10694, 2017.
W. Rudin. Functional Analysis. McGraw-Hill, 1973.
David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. Cognitive Modeling, 5(3):1, 1988.
Olga Russakovsky, Jia Deng, Zhiheng Huang, Alexander C. Berg, and Li Fei-Fei. Detecting avocados to zucchinis: what have we done, and where are we going? In International Conference on Computer Vision (ICCV). 2013.
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, and et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
Stuart J Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Pearson Education Limited, 2016.
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, and et al. Photorealistic text-to-image diffusion models with deep language understanding. ArXiv:2205.11487, 2022.
D. Salinas, M. Seeger, A. Klein, V. Perrone, M. Wistuba, and C. Archambeau. Syne Tune: a library for large scale hyperparameter tuning and reproducible research. In First Conference on Automated Machine Learning. 2022.
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv:1910.01108, 2019.
Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, and et al. Multitask prompted training enables zero-shot task generalization. ArXiv:2110.08207, 2021.
Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization? In Advances in Neural Information Processing Systems, 2483–2493. 2018.
Badrul Munir Sarwar, George Karypis, Joseph A Konstan, and John Riedl. Item-based collaborative filtering recommendation algorithms. In Proceedings of 10th International Conference on World Wide Web, 285–295. 2001.
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, and et al. BLOOM: a 176B-parameter open-access multilingual language model. ArXiv:2211.05100, 2022.
Andrew I Schein, Alexandrin Popescul, Lyle H Ungar, and David M Pennock. Methods and metrics for cold-start recommendations. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 253–260. ACM, 2002.
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, and et al. LAION-5B: an open large-scale dataset for training next generation image-text models. ArXiv:2210.08402, 2022.
Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681, 1997.
B. Schölkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In D. P. Helmbold and B. Williamson, editors, Proceedings of the Annual Conference on Computational Learning Theory, 416–426. Springer-Verlag, 2001.
Bernhard Schölkopf, Chris Burges, and Vladimir Vapnik. Incorporating invariances in support vector learning machines. In International Conference on Artificial Neural Networks, 47–52. Springer, 1996.
Bernhard Schölkopf and Alexander J Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2002.
Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. Autorec: autoencoders meet collaborative filtering. In Proceedings of the 24th International Conference on World Wide Web, 111–112. ACM, 2015.
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. ArXiv:1508.07909, 2015.
Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learning in TensorFlow. ArXiv:1802.05799, 2018.
Claude Elwood Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27(3):379–423, 1948.
Huajie Shao, Shuochao Yao, Dachun Sun, Aston Zhang, Shengzhong Liu, Dongxin Liu, Jun Wang, and Tarek Abdelzaher. ControlVAE: controllable variational autoencoder. In Proceedings of the 37th International Conference on Machine Learning. JMLR. org, 2020.
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. ArXiv:1803.02155, 2018.
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: training multi-billion parameter language models using model parallelism. ArXiv:1909.08053, 2019.
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, and et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484, 2016.
B. W. Silverman. Density Estimation for Statistical and Data Analysis. Chapman and Hall, 1986.
Patrice Y Simard, Yann A LeCun, John S Denker, and Bernard Victorri. Transformation invariance in pattern recognition – tangent distance and tangent propagation. In Neural Networks: Tricks of the Trade, pages 239–274. Springer, 1998.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ArXiv:1409.1556, 2014.
Vikas Sindhwani, Tara N Sainath, and Sanjiv Kumar. Structured transforms for small-footprint deep learning. ArXiv:1510.01722, 2015.
Josef Sivic and Andrew Zisserman. Video Google: a text retrieval approach to object matching in videos. In Proceedings of the IEEE International Conference on Computer Vision, volume 3, 1470–1470. IEEE Computer Society, 2003.
Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, and et al. Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, a large-scale generative language model. ArXiv:2201.11990, 2022.
Alexander Smola and Shravan Narayanamurthy. An architecture for parallel topic models. Proceedings of the VLDB Endowment, 3(1-2):703–710, 2010.
J. Snoek, H. Larochelle, and R. Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25, 2951–2959. 2012.
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, 2256–2265. PMLR, 2015.
Bert Speelpenning. Compiling fast partial derivatives of functions given by algorithms. PhD thesis, University of Illinois at Urbana-Champaign, 1980.
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, and et al. Beyond the imitation game: quantifying and extrapolating the capabilities of language models. ArXiv:2206.04615, 2022.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. ArXiv:1505.00387, 2015.
Gilbert Strang. Introduction to Linear Algebra. Wellesley–Cambridge Press, 1993.
Xiaoyuan Su and Taghi M Khoshgoftaar. A survey of collaborative filtering techniques. Advances in Artificial Intelligence, 2009.
Sainbayar Sukhbaatar, Jason Weston, and Rob Fergus. End-to-end memory networks. In Advances in Neural Information Processing Systems, 2440–2448. 2015.
Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning, 1139–1147. 2013.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, 3104–3112. 2014.
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In 31st AAAI Conference on Artificial Intelligence. 2017.
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1–9. 2015.
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the Inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826. 2016.
Corentin Tallec and Yann Ollivier. Unbiasing truncated backpropagation through time. ArXiv:1705.08209, 2017.
Mingxing Tan and Quoc Le. EfficientNet: rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, 6105–6114. PMLR, 2019.
Jiaxi Tang and Ke Wang. Personalized top-n sequential recommendation via convolutional sequence embedding. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, 565–573. ACM, 2018.
Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin Markov networks. Advances in Neural Information Processing Systems, 16:25, 2004.
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: a survey. ArXiv:2009.06732, 2020.
Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: a large language model for science. ArXiv:2211.09085, 2022.
Mattias Teye, Hossein Azizpour, and Kevin Smith. Bayesian uncertainty estimation for batch normalized deep networks. ArXiv:1802.06455, 2018.
Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: the new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
Tijmen Tieleman and Geoffrey Hinton. Divide the gradient by a running average of its recent magnitude. In COURSERA: Neural Networks for Machine Learning, Lecture 6.5-rmsprop. 2012.
A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill-Posed Problems. W.H. Winston, 1977.
Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, and et al. MLP-mixer: an all-MLP architecture for vision. Advances in Neural Information Processing Systems, 2021.
Antonio Torralba, Rob Fergus, and William T Freeman. 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11):1958–1970, 2008.
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, 10347–10357. PMLR, 2021.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, and et al. LLaMA: open and efficient foundation language models. ArXiv:2302.13971, 2023a.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and et al. LLaMA 2: open foundation and fine-tuned chat models. ArXiv:2307.09288, 2023b.
Grigorios Tsoumakas and Ioannis Katakis. Multi-label classification: an overview. International Journal of Data Warehousing and Mining, 3(3):1–13, 2007.
Alan Turing. Computing machinery and intelligence. Mind, 59(236):433, 1950.
Andreas Töscher, Michael Jahrer, and Robert M Bell. The bigchaos solution to the Netflix grand prize. 2009.
Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. Selective search for object recognition. International Journal of Computer Vision, 104(2):154–171, 2013.
V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.
V. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 1998.
V. Vapnik and A. Chervonenkis. A note on one class of perceptrons. Automation and Remote Control, 1964.
V. Vapnik and A. Chervonenkis. Uniform convergence of frequencies of occurence of events to their probabilities. Dokl. Akad. Nauk SSSR, 181:915–918, 1968.
V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl., 16(2):264–281, 1971.
V. Vapnik and A. Chervonenkis. The necessary and sufficient conditions for the uniform convergence of averages to their expected values. Teoriya Veroyatnostei i Ee Primeneniya, 26(3):543–564, 1981.
V. Vapnik and A. Chervonenkis. The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recognition and Image Analysis, 1(3):283–305, 1991.
V. N. Vapnik and A. Y. Chervonenkis. Ordered risk minimization. Automation and Remote Control, 35:1226–1235, 1403–1412, 1974.
Vladimir Vapnik. Principles of risk minimization for learning theory. In Advances in Neural Information Processing Systems, 831–838. 1992.
Vladimir Vapnik, Esther Levin, and Yann Le Cun. Measuring the VC-dimension of a learning machine. Neural Computation, 6(5):851–876, 1994.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 5998–6008. 2017.
Grace Wahba. Spline Models for Observational Data. SIAM, 1990.
Alex Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro Shikano, and Kevin J Lang. Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3):328–339, 1989.
Haotao Wang, Aston Zhang, Shuai Zheng, Xingjian Shi, Mu Li, and Zhangyang Wang. Removing batch normalization boosts adversarial training. In International Conference on Machine Learning, 23433–23445. PMLR, 2022.
Leyuan Wang, Mu Li, Edo Liberty, and Alex J Smola. Optimal message scheduling for aggregation. Networks, 2(3):2–3, 2018.
Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F Wong, and Lidia S Chao. Learning deep transformer models for machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1810–1822. 2019.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In International Conference on Learning Representations. 2023.
Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D Owens. Gunrock: a high-performance graph processing library on the GPU. In ACM SIGPLAN Notices, volume 51, 11. ACM, 2016.
Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641, 2019.
Larry Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2013.
Christopher JCH Watkins and Peter Dayan. Q-learning. Machine Learning, 8(3–4):279–292, 1992.
Geoffrey S Watson. Smooth regression analysis. Sankhyā: The Indian Journal of Statistics, Series A, pages 359–372, 1964.
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. ArXiv:2109.01652, 2021.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, and et al. Emergent abilities of large language models. ArXiv:2206.07682, 2022.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. ArXiv:2201.11903, 2022.
Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), 681–688. 2011.
Robert Edwin Wengert. A simple automatic derivative evaluation program. Communications of the ACM, 7(8):463–464, 1964.
Paul J Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
Eugene P. Wigner. On the distribution of the roots of certain symmetric matrices. In Ann. Math., 325–327. 1958.
Andrew G Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. Advances in Neural Information Processing Systems, 33:4697–4708, 2020.
M. Wistuba, A. Rawat, and T. Pedapati. A survey on neural architecture search. ArXiv:1905.01392 [cs.LG], 2019.
M. Wistuba, N. Schilling, and L. Schmidt-Thieme. Scalable Gaussian process-based transfer surrogates for hyperparameter optimization. Machine Learning, 108:43–78, 2018.
David H Wolpert and William G Macready. No free lunch theorems for search. Technical Report, Technical Report SFI-TR-95-02-010, Santa Fe Institute, 1995.
Frank Wood, Jan Gasthaus, Cédric Archambeau, Lancelot James, and Yee Whye Teh. The sequence memoizer. Communications of the ACM, 54(2):91–98, 2011.
Bichen Wu, Alvin Wan, Xiangyu Yue, Peter Jin, Sicheng Zhao, Noah Golmant, Amir Gholaminejad, Joseph Gonzalez, and Kurt Keutzer. Shift: a zero flop, zero parameter alternative to spatial convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 9127–9135. 2018.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, and et al. Google's neural machine translation system: bridging the gap between human and machine translation. ArXiv:1609.08144, 2016.
Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. ArXiv:1708.07747, 2017.
Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel Schoenholz, and Jeffrey Pennington. Dynamical isometry and a mean field theory of CNNs: how to train 10,000-layer vanilla convolutional neural networks. In International Conference on Machine Learning, 5393–5402. 2018.
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1492–1500. 2017.
Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. In International Conference on Machine Learning, 10524–10533. PMLR, 2020.
Wayne Xiong, Lingfeng Wu, Fil Alleva, Jasha Droppo, Xuedong Huang, and Andreas Stolcke. The Microsoft 2017 conversational speech recognition system. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5934–5938. IEEE, 2018.
Kouichi Yamaguchi, Kenji Sakamoto, Toshio Akabane, and Yoshiji Fujimoto. A neural network for speaker-independent isolated word recognition. In First International Conference on Spoken Language Processing. 1990.
Zichao Yang, Zhiting Hu, Yuntian Deng, Chris Dyer, and Alex Smola. Neural machine translation with recurrent attention modeling. ArXiv:1607.05108, 2016.
Zichao Yang, Marcin Moczulski, Misha Denil, Nando De Freitas, Alex Smola, Le Song, and Ziyu Wang. Deep fried convnets. In Proceedings of the IEEE International Conference on Computer Vision, 1476–1483. 2015.
Mao Ye, Peifeng Yin, Wang-Chien Lee, and Dik-Lun Lee. Exploiting geographical influence for collaborative point-of-interest recommendation. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, 325–334. ACM, 2011.
Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks. ArXiv:1708.03888, 2017.
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation. ArXiv:2206.10789, 2022.
Manzil Zaheer, Sashank Reddi, Devendra Sachan, Satyen Kale, and Sanjiv Kumar. Adaptive methods for nonconvex optimization. In Advances in Neural Information Processing Systems, 9793–9803. 2018.
Matthew D Zeiler. ADADELTA: an adaptive learning rate method. ArXiv:1212.5701, 2012.
Matthew D Zeiler and Rob Fergus. Stochastic pooling for regularization of deep convolutional neural networks. ArXiv:1301.3557, 2013.
Aston Zhang, Yi Tay, Shuai Zhang, Alvin Chan, Anh Tuan Luu, Siu Cheung Hui, and Jie Fu. Beyond fully-connected layers with quaternions: parameterization of hypercomplex multiplications with 1/n parameters. In International Conference on Learning Representations. 2021.
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. Deep learning based recommender system: a survey and new perspectives. ACM Computing Surveys, 52(1):5, 2019.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, and et al. OPT: open pre-trained transformer language models. ArXiv:2205.01068, 2022.
Wei Zhang, Jun Tanida, Kazuyoshi Itoh, and Yoshiki Ichioka. Shift-invariant pattern recognition neural network and its optical architecture. In Proceedings of Annual Conference of the Japan Society of Applied Physics. 1988.
Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Zehuan Yuan, Ping Luo, Wenyu Liu, and Xinggang Wang. ByteTrack: multi-object tracking by associating every detection box. ArXiv:2110.06864, 2021.
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. In International Conference on Learning Representations. 2023.
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models. ArXiv:2302.00923, 2023.
Zhong-Qiu Zhao, Peng Zheng, Shou-tao Xu, and Xindong Wu. Object detection with deep learning: a review. IEEE Transactions on Neural Networks and Learning Systems, 30(11):3212–3232, 2019.
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models. In International Conference on Learning Representations. 2023.
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, 2223–2232. 2017.
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision, 19–27. 2015.
Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. ArXiv:1611.01578, 2016.