References
A. M. Legendre. Mémoire sur les Opérations Trigonométriques: dont les Résultats Dépendent de la Figure de la Terre (F. Didot, 1805).
C. F. Gauss. Theoria motus corporum coelestum. In: Werke (Königlich Preussische Akademie der Wissenschaften, 1809).
G. H. Golub and C. F. Van Loan. Matrix Computations (Johns Hopkins University Press, 1996).
D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming 45, 503–528 (1989).
L. Bottou. Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT'2010 (Springer, 2010); pp. 177–186.
M. Li, T. Zhang, Y. Chen and A. J. Smola. Efficient mini-batch training for stochastic optimization. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2014); pp. 661–670.
P. I. Frazier. A tutorial on Bayesian optimization. ArXiv:1807.02811 (2018).
P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov and A. G. Wilson. Averaging weights leads to wider optima and better generalization. ArXiv:1803.05407 (2018).
J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. ArXiv:1803.03635 (2018).
S. J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach (Pearson Education Limited, 2016).
F. Black and M. Scholes. The pricing of options and corporate liabilities. Journal of Political Economy 81, 637–654 (1973).
M. Naor and O. Reingold. On the construction of pseudorandom permutations: Luby–Rackoff revisited. Journal of Cryptology 12, 29–66 (1999).
V. Vapnik. Principles of risk minimization for learning theory. In: Advances in Neural Information Processing Systems (1992); pp. 831–838.
J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley and Y. Bengio. Theano: A CPU and GPU math compiler in Python. In: Proc. 9th Python in Science Conference, Vol. 1 (2010); pp. 3–10.
J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker and al. Large scale distributed deep networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems, Volume 1 (2012); pp. 1223–1231.
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia (2014); pp. 675–678.
L. Bottou and Y. Le Cun. SN: A simulator for connectionist models. In: Proceedings of NeuroNimes 88 (Nimes, France, 1988); pp. 371–382.
T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang and Z. Zhang. MXNET: A flexible and efficient machine learning library for heterogeneous distributed systems. ArXiv:1512.01274 (2015).
R. Frostig, M. J. Johnson and C. Leary. Compiling machine learning programs via high-level tracing. In: Proceedings of Systems for Machine Learning (GoogleResearch, 2018).
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga and al. PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32, 8026–8037 (2019).
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard and al. TensorFlow: A system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (2016); pp. 265–283.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2009); pp. 248–255.
B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth and L.-J. Li. YFCC100M: The new data in multimedia research. Communications of the ACM 59, 64–73 (2016).
V. Vapnik. Statistical Learning Theory (John Wiley and Sons, New York, 1998).
S. Boucheron, O. Bousquet and G. Lugosi. Theory of classification: A survey of some recent advances. ESAIM: Probability and Statistics 9, 323–375 (2005).
V. Vapnik, E. Levin and Y. Le Cun. Measuring the VC-dimension of a learning machine. Neural Computation 6, 851–876 (1994).
B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond (MIT Press, 2002).
C. S. Ong, A. Smola and R. Williamson. Learning the kernel with hyperkernels. Journal of Machine Learning Research 6, 1043–1071 (2005).
G. Tsoumakas and I. Katakis. Multi-label classification: An overview. International Journal of Data Warehousing and Mining 3, 1–13 (2007).
Z. Huang, W. Xu and K. Yu. Bidirectional LSTM–CRF models for sequence tagging. ArXiv:1508.01991 (2015).
T. Moon, A. Smola, Y. Chang and Z. Zheng. Intervalrank: isotonic regression with listwise and pairwise constraints. In: Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (2010); pp. 151–160.
A. Beutel, K. Murray, C. Faloutsos and A. J. Smola. CoBaFi: collaborative Bayesian filtering. In: Proceedings of the 23rd International Conference on World Wide Web (2014); pp. 97–108.
C. E. Shannon. A Mathematical Theory of Communication. The Bell System Technical Journal 27, 379–423 (1948).
Z. Yang, M. Moczulski, M. Denil, N. De Freitas, A. Smola, L. Song and Z. Wang. Deep fried convnets. In: Proceedings of the IEEE International Conference on Computer Vision (2015); pp. 1476–1483.
V. Sindhwani, T. N. Sainath and S. Kumar. Structured transforms for small-footprint deep learning. ArXiv:1510.01722 (2015).
A. Zhang, Y. Tay, S. Zhang, A. Chan, A. T. Luu, S. C. Hui and J. Fu. Beyond fully-connected layers with quaternions: Parameterization of hypercomplex multiplications with 1/n parameters. In: International Conference on Learning Representations (2021).
R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika 39, 324–345 (1952).
Y. LeCun, L. Bottou, Y. Bengio and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 2278–2324 (1998).
Y. LeCun, L. Jackel, L. Bottou, A. Brunot, C. Cortes, J. Denker, H. Drucker, I. Guyon, U. Muller, E. Sackinger and al. Comparison of learning algorithms for handwritten digit recognition. In: International Conference on Artificial Neural Networks (1995); pp. 53–60.
B. Schölkopf, C. Burges and V. Vapnik. Incorporating invariances in support vector learning machines. In: International Conference on Artificial Neural Networks (Springer, 1996); pp. 47–52.
P. Y. Simard, Y. A. LeCun, J. S. Denker and B. Victorri. Transformation invariance in pattern recognition – tangent distance and tangent propagation. In: Neural Networks: Tricks of the Trade, Vol. 529 no. 7587 (Springer, 1998); pp. 239–274.
H. Xiao, K. Rasul and R. Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. ArXiv:1708.07747 (2017).
C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold and A. L. Roth. Preserving statistical validity in adaptive data analysis. In: Proceedings of the 47th Annual ACM Symposium on Theory of Computing (2015); pp. 117–126.
V. Vapnik and A. Chervonenkis. A note on one class of perceptrons. Automation and Remote Control 25 (1964).
V. Vapnik and A. Chervonenkis. Uniform convergence of frequencies of occurence of events to their probabilities. Dokl.Ãkad.Ñauk SSSR 181, 915–918 (1968).
V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16, 264–281 (1971).
V. Vapnik and A. Chervonenkis. Ordered risk minimization. Automation and Remote Control 35, 1226–1235, 1403–1412 (1974).
V. Vapnik and A. Chervonenkis. The necessary and sufficient conditions for the uniform convergence of averages to their expected values. Teoriya Veroyatnostei i Ee Primeneniya 26, 543–564 (1981).
V. Vapnik and A. Chervonenkis. The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recognition and Image Analysis 1, 283–305 (1991).
H. Shao, S. Yao, D. Sun, A. Zhang, S. Liu, D. Liu, J. Wang and T. Abdelzaher. ControlVAE: Controllable variational autoencoder. In: Proceedings of the 37th International Conference on Machine Learning (JMLR. org, 2020).
R. A. Fisher. Statistical Methods for Research Workers. (Oliver & Boyd, 1925).
J. R. Quinlan. C4.5: Programs for Machine Learning. Vol. 51 no. 4 (Elsevier, 1993); p. 66.
N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society 68, 337–404 (1950).
G. Wahba. Spline Models for Observational Data (SIAM, 1990).
S. Ramón y Cajal and L. Azoulay. Les Nouvelles Idées sur la Structure du Système Nerveux chez l'Homme et chez les Vertébrés. Vol. 11 no. 2 (Paris, C. Reinwald & Cie, 1894); p. 125.
W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics 5, 115–133 (1943).
Y. LeCun, L. Bottou, G. Orr and K.-R. Muller. Efficient backprop. In: Neural Networks: Tricks of the Trade (Springer, 1998).
B. L. Kalman and S. C. Kwasny. Why tanh: choosing a sigmoidal function. In: Proceedings of the International Joint Conference on Neural Networks (IJCNN) (IEEE, 1992); pp. 578–581.
D. Hendrycks and K. Gimpel. Gaussian error linear units (GELUs). ArXiv:1606.08415 (2016).
P. Ramachandran, B. Zoph and Q. V. Le. Searching for activation functions. ArXiv:1710.05941 (2017).
S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ArXiv:1502.03167 (2015).
L. Xiao, Y. Bahri, J. Sohl-Dickstein, S. Schoenholz and J. Pennington. Dynamical isometry and a mean field theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks. In: International Conference on Machine Learning (2018); pp. 5393–5402.
Y. You, I. Gitman and B. Ginsburg. Large batch training of convolutional networks. ArXiv:1708.03888 (2017).
D. H. Wolpert and W. G. Macready. No free lunch theorems for search (Technical Report SFI-TR-95-02-010, Santa Fe Institute, 1995).
C. Zhang, S. Bengio, M. Hardt, B. Recht and O. Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM 64, 107–115 (2021).
P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak and I. Sutskever. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment 2021, 124003 (2021).
A. Jacot, F. Gabriel and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In: Advances in Neural Information Processing Systems, Vol. 31 (2018).
D. Rolnick, A. Veit, S. Belongie and N. Shavit. Deep learning is robust to massive label noise. ArXiv:1705.10694 (2017).
S. Garg, S. Balakrishnan, Z. Kolter and Z. Lipton. RATT: Leveraging unlabeled data to guarantee generalization. In: International Conference on Machine Learning, Vol. 31 (PMLR, 2021); pp. 3598–3609.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1929–1958 (2014).
C. M. Bishop. Training with noise is equivalent to Tikhonov regularization. Neural Computation 7, 108–116 (1995).
D. H. Hubel and T. N. Wiesel. Receptive fields of single neurones in the cat's striate cortex. Journal of Physiology 148, 574–591 (1959).
D. H. Hubel and T. N. Wiesel. Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. Journal of Physiology 160, 106–154 (1962).
D. H. Hubel and T. N. Wiesel. Receptive fields and functional architecture of monkey striate cortex. Journal of Physiology 195, 215–243 (1968).
D. J. Field. Relations between the statistics of natural images and the response properties of cortical cells. JOSA A 4, 2379–2394 (1987).
I. Kuzovkin, R. Vicente, M. Petton, J.-P. Lachaux, M. Baciu, P. Kahane, S. Rheims, J. R. Vidal and J. Aru. Activations of deep convolutional neural networks are aligned with gamma band activity of human visual cortex. Communications Biology 1, 1–12 (2018).
B. Alsallakh, N. Kokhlikyan, V. Miglani, J. Yuan and O. Reblitz-Richardson. Mind the PAD – CNNs can develop blind spots. ArXiv:2010.02178 (2020).
M. Lin, Q. Chen and S. Yan. Network in network. ArXiv:1312.4400 (2013).
C. Szegedy, S. Ioffe, V. Vanhoucke and A. A. Alemi. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: 31st AAAI Conference on Artificial Intelligence (2017).
S. Xie, R. Girshick, P. Dollár, Z. Tu and K. He. Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017); pp. 1492–1500.
M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience 2, 1019–1025 (1999).
K. Yamaguchi, K. Sakamoto, T. Akabane and Y. Fujimoto. A neural network for speaker-independent isolated word recognition. In: First International Conference on Spoken Language Processing (1990).
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation 1, 541–551 (1989).
Y. Zhang, P. Sun, Y. Jiang, D. Yu, Z. Yuan, P. Luo, W. Liu and X. Wang. ByteTrack: Multi-object tracking by associating every detection box. ArXiv:2110.06864 (2021).
J. Long, E. Shelhamer and T. Darrell. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015); pp. 3431–3440.
J. Redmon and A. Farhadi. YOLOv3: An incremental improvement. ArXiv:1804.02767 (2018).
L. A. Gatys, A. S. Ecker and M. Bethge. Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016); pp. 2414–2423.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly and al. An image is worth 16 x 16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021).
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, Vol. 34 (2021); pp. 10012–10022.
A. Krizhevsky, I. Sutskever and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (2012); pp. 1097–1105.
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. ArXiv:1409.1556 (2014).
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015); pp. 1–9.
K. He, X. Zhang, S. Ren and J. Sun. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016); pp. 770–778.
G. Huang, Z. Liu, L. Van Der Maaten and K. Q. Weinberger. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017); pp. 4700–4708.
B. Wu, A. Wan, X. Yue, P. Jin, S. Zhao, N. Golmant, A. Gholaminejad, J. Gonzalez and K. Keutzer. Shift: A zero flop, zero parameter alternative to spatial convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018); pp. 9127–9135.
A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, Q. V. Le and H. Adam. Searching for MobileNetV3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2019); pp. 1314–1324.
I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He and P. Dollár. Designing network design spaces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020); pp. 10428–10436.
Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell and S. Xie. A ConvNet for the 2020s. ArXiv:2201.03545 (2022).
Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In: Proceedings of the International Conference on Machine Learning, Vol. 96 (Citeseer, 1996); pp. 148–156.
B. Taskar, C. Guestrin and D. Koller. Max-margin Markov networks. Advances in Neural Information Processing Systems 16, 25 (2004).
D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91–110 (2004).
H. Bay, T. Tuytelaars and L. Van Gool. SURF: Speeded up robust features. In: European Conference on Computer Vision (Springer, 2006); pp. 404–417.
J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In: Proceedings of the IEEE International Conference on Computer Vision, Vol. 3 (IEEE Computer Society, 2003); pp. 1470–1470.
R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision (Cambridge University Press, 2000).
X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, Vol. 4 (2010); pp. 249–256.
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ArXiv:1412.6980 (2014).
V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltzmann machines. In: ICML (2010).
S. Boyd and L. Vandenberghe. Convex Optimization (Cambridge University Press, Cambridge, England, 2004).
R. I. Hartley and F. Kahl. Global optimization through rotation space search. International Journal of Computer Vision 82, 64–79 (2009).
N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), Vol. 1 (IEEE, 2005); pp. 886–893.
B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381, 607–609 (1996).
Q. V. Le. Building high-level features using large scale unsupervised learning. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE, 2013); pp. 8595–8598.
G. A. Miller. WordNet: a lexical database for English. Communications of the ACM 38, 39–41 (1995).
A. Torralba, R. Fergus and W. T. Freeman. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 1958–1970 (2008).
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein and al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 211–252 (2015).
C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman and al. LAION-5B: An open large-scale dataset for training next generation image-text models. ArXiv:2210.08402 (2022).
R. Fernando. GPU Gems: Programming Techniques, Tips, and Tricks for Real-Time Graphics. Vol. 2 (Addison-Wesley, 2004).
O. Russakovsky, J. Deng, Z. Huang, A. C. Berg and L. Fei-Fei. Detecting avocados to zucchinis: what have we done, and where are we going? In: International Conference on Computer Vision (ICCV), Vol. 5 no. 3 (2013); p. 1.
A. Buslaev, V. I. Iglovikov, E. Khvedchenya, A. Parinov, M. Druzhinin and A. A. Kalinin. Albumentations: Fast and flexible image augmentations. Information 11, 125 (2020).
C. Mead. Introduction to VLSI systems. IEE Proceedings I-Solid-State and Electron Devices 128, 18 (1980).
R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill and al. On the opportunities and risks of foundation models. ArXiv:2108.07258 (2021).
A. Lavin and S. Gray. Fast algorithms for convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, No. 3 (MIT Press, 2016); pp. 4013–4021.
A. Goyal, A. Bochkovskiy, J. Deng and V. Koltun. Non-deep networks. ArXiv:2110.07641 (2021).
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna. Rethinking the Inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016); pp. 2818–2826.
J. H. Friedman. Exploratory projection pursuit. Journal of the American Statistical Association 82, 249–266 (1987).
I. Guyon, S. Gunn, M. Nikravesh and L. A. Zadeh. Feature Extraction: Foundations and Applications (Springer, 2008).
V. Vapnik. The Nature of Statistical Learning Theory (Springer, New York, 1995).
A. B. J. Novikoff. On convergence proofs on perceptrons. In: Proceedings of the Symposium on the Mathematical Theory of Automata (Polytechnic Institute of Brooklyn, 1962); pp. 615–622.
J. L. Ba, J. R. Kiros and G. E. Hinton. Layer normalization. ArXiv:1607.06450 (2016).
J. Duchi, E. Hazan and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, 2121–2159 (2011).
M. Zaheer, S. Reddi, D. Sachan, S. Kale and S. Kumar. Adaptive methods for nonconvex optimization. In: Advances in Neural Information Processing Systems (2018); pp. 9793–9803.
R. Anil, V. Gupta, T. Koren, K. Regan and Y. Singer. Scalable second-order optimization for deep learning. ArXiv:2002.09018 (2020).
M. Teye, H. Azizpour and K. Smith. Bayesian uncertainty estimation for batch normalized deep networks. ArXiv:1802.06455 (2018).
P. Luo, X. Wang, W. Shao and Z. Peng. Towards understanding regularization in batch normalization. ArXiv:1809.00846 (2018).
Z. C. Lipton and J. Steinhardt. Troubling trends in machine learning scholarship. Communications of the ACM 17, 45–77 (2018).
S. Santurkar, D. Tsipras, A. Ilyas and A. Madry. How does batch normalization help optimization? In: Advances in Neural Information Processing Systems (2018); pp. 2483–2493.
H. Wang, A. Zhang, S. Zheng, X. Shi, M. Li and Z. Wang. Removing batch normalization boosts adversarial training. In: International Conference on Machine Learning (PMLR, 2022); pp. 23433–23445.
A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill-Posed Problems. Vol. 2021 no. 12 (W.H. Winston, 1977); p. 124003.
V. A. Morozov. Methods for Solving Incorrectly Posed Problems (Springer, 1984).
A. Prakash, S. A. Hasan, K. Lee, V. Datla, A. Qadir, J. Liu and O. Farri. Neural paraphrase generation with stacked residual LSTM networks. ArXiv:1610.03098 (2016).
J. Kim, M. El-Khamy and J. Lee. Residual LSTM: Design of a deep recurrent architecture for distant speech recognition. ArXiv:1701.03360 (2017).
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser and I. Polosukhin. Attention is all you need. In: Advances in Neural Information Processing Systems (2017); pp. 5998–6008.
T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. ArXiv:1609.02907 (2016).
S. Ren, K. He, R. Girshick and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (2015); pp. 91–99.
R. K. Srivastava, K. Greff and J. Schmidhuber. Highway networks. ArXiv:1505.00387 (2015).
K. He, X. Zhang, S. Ren and J. Sun. Identity mappings in deep residual networks. In: European Conference on Computer Vision (Springer, 2016); pp. 630–645.
G. Pleiss, D. Chen, G. Huang, T. Li, L. Van Der Maaten and K. Q. Weinberger. Memory-efficient implementation of densenets. ArXiv:1707.06990 (2017).
J. Hu, L. Shen and G. Sun. Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018); pp. 7132–7141.
B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. ArXiv:1611.01578 (2016).
H. Liu, K. Simonyan and Y. Yang. DARTS: Differentiable architecture search. ArXiv:1806.09055 (2018).
M. Tan and Q. Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning (PMLR, 2019); pp. 6105–6114.
A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke and J. Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 855–868 (2008).
I. Sutskever, O. Vinyals and Q. V. Le. Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems (2014); pp. 3104–3112.
Z. C. Lipton, D. C. Kale, C. Elkan and R. Wetzel. Learning to diagnose with LSTM recurrent neural networks. In: International Conference on Learning Representations (ICLR) (2016).
Z. C. Lipton, J. Berkowitz and C. Elkan. A critical review of recurrent neural networks for sequence learning. ArXiv:1506.00019 17, 45–77 (2015).
P. O. Hoyer, D. Janzing, J. M. Mooij, J. Peters and B. Schölkopf. Nonlinear causal discovery with additive noise models. In: Advances in Neural Information Processing Systems (2009); pp. 689–696.
J. Peters, D. Janzing and B. Schölkopf. Elements of Causal Inference: Foundations and Learning Algorithms (MIT Press, 2017).
F. Wood, J. Gasthaus, C. Archambeau, L. James and Y. W. Teh. The sequence memoizer. Communications of the ACM 54, 91–98 (2011).
Y. Bengio, R. Ducharme, P. Vincent and C. Jauvin. A neural probabilistic language model. Journal of Machine Learning Research 3, 1137–1155 (2003).
P. J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE 78, 1550–1560 (1990).
H. Jaeger. Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the ``echo state network'' approach. Vol. 31 (GMD-Forschungszentrum Informationstechnik Bonn, 2002).
C. Tallec and Y. Ollivier. Unbiasing truncated backpropagation through time. ArXiv:1705.08209 (2017).
J. L. Elman. Finding structure in time. Cognitive Science 14, 179–211 (1990).
Y. Bengio, P. Simard and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks 5, 157–166 (1994).
S. Hochreiter, Y. Bengio, P. Frasconi and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In: A Field Guide to Dynamical Recurrent Neural Networks (IEEE Press, 2001).
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation 9, 1735–1780 (1997).
K. Cho, B. Van Merriënboer, D. Bahdanau and Y. Bengio. On the properties of neural machine translation: Encoder–decoder approaches. ArXiv:1409.1259 (2014).
J. Chung, C. Gulcehre, K. Cho and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. ArXiv:1412.3555 (2014).
M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 2673–2681 (1997).
P. F. Brown, J. Cocke, S. A. Della Pietra, V. J. Della Pietra, F. Jelinek, R. L. Mercer and P. Roossin. A statistical approach to language translation. In: COLING Budapest 1988 Volume 1: International Conference on Computational Linguistics (1988).
P. F. Brown, J. Cocke, S. A. Della Pietra, V. J. Della Pietra, F. Jelinek, J. Lafferty, R. L. Mercer and P. S. Roossin. A statistical approach to machine translation. Computational Linguistics 16, 79–85 (1990).
K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk and Y. Bengio. Learning phrase representations using RNN encoder–decoder for statistical machine translation. ArXiv:1406.1078 (2014).
K. Papineni, S. Roukos, T. Ward and W.-J. Zhu. BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (2002); pp. 311–318.
J. Devlin, M.-W. Chang, K. Lee and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. ArXiv:1810.04805 (2018).
K. Clark, M.-T. Luong, Q. V. Le and C. D. Manning. ELECTRA: Pre-training text encoders as discriminators rather than generators. In: International Conference on Learning Representations (2020).
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer and V. Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach. ArXiv:1907.11692 (2019).
I. Beltagy, M. E. Peters and A. Cohan. Longformer: The long-document transformer. ArXiv:2004.05150 24 (2020).
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei and I. Sutskever. Language models are unsupervised multitask learners. OpenAI Blog 1, 9 (2019).
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell and al. Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020).
A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu and al. Conformer: Convolution-augmented Transformer for Speech Recognition. Proc. Interspeech 2020, 5036–5040 (2020).
L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas and I. Mordatch. Decision transformer: Reinforcement learning via sequence modeling. Advances in Neural Information Processing Systems 34, 15084–15097 (2021).
V. P. Dwivedi and X. Bresson. A generalization of transformer networks to graphs. ArXiv:2012.09699 (2020).
D. Bahdanau, K. Cho and Y. Bengio. Neural machine translation by jointly learning to align and translate. ArXiv:1409.0473 (2014).
N. Kalchbrenner, E. Grefenstette and P. Blunsom. A convolutional neural network for modelling sentences. ArXiv:1404.2188 (2014).
Z. Yang, Z. Hu, Y. Deng, C. Dyer and A. Smola. Neural machine translation with recurrent attention modeling. ArXiv:1607.05108 (2016).
V. Mnih, N. Heess, A. Graves and others. Recurrent models of visual attention. In: Advances in Neural Information Processing Systems (2014); pp. 2204–2212.
E. A. Nadaraya. On estimating regression. Theory of Probability & its Applications 9, 141–142 (1964).
G. S. Watson. Smooth regression analysis. Sankhyā: The Indian Journal of Statistics, Series A, 359–372 (1964).
Y.-P. Mack and B. W. Silverman. Weak and strong uniform consistency of kernel regression estimates. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 61, 405–415 (1982).
B. Silverman. Density Estimation for Statistical and Data Analysis (Chapman and Hall, 1986).
A. Norelli, M. Fumero, V. Maiorca, L. Moschella, E. Rodolà and F. Locatello. ASIF: Coupled data turns unimodal models to multimodal without training. ArXiv:2210.01738 (2022).
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper and B. Catanzaro. Megatron-LM: Training multi-billion parameter language models using model parallelism. ArXiv:1909.08053 (2019).
A. Graves. Generating sequences with recurrent neural networks. ArXiv:1308.0850 (2013).
L. Rabiner and B.-H. Juang. Fundamentals of Speech Recognition (Prentice-Hall., 1993).
W. Chan, N. Jaitly, Q. V. Le and O. Vinyals. Listen, attend and spell. ArXiv:1508.01211 (2015).
Z. Lin, M. Feng, C. N. Santos, M. Yu, B. Xiang, B. Zhou and Y. Bengio. A structured self-attentive sentence embedding. ArXiv:1703.03130 (2017).
J. Cheng, L. Dong and M. Lapata. Long short-term memory-networks for machine reading. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (2016); pp. 551–561.
A. P. Parikh, O. Täckström, D. Das and J. Uszkoreit. A decomposable attention model for natural language inference. ArXiv:1606.01933 (2016).
R. Paulus, C. Xiong and R. Socher. A deep reinforced model for abstractive summarization. ArXiv:1705.04304 (2017).
P. Shaw, J. Uszkoreit and A. Vaswani. Self-attention with relative position representations. ArXiv:1803.02155 (2018).
C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, I. Simon, C. Hawthorne, N. Shazeer, A. M. Dai, M. D. Hoffman, M. Dinculescu and D. Eck. Music transformer: generating music with long-term structure. In: International Conference on Learning Representations (2018).
N. Bodla, B. Singh, R. Chellappa and L. S. Davis. Soft-NMS-improving object detection with one line of code. In: Proceedings of the IEEE International Conference on Computer Vision (2017); pp. 5561–5569.
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu and A. C. Berg. SSD: Single shot multibox detector. In: European Conference on Computer Vision (Springer, 2016); pp. 21–37.
V. Dumoulin and F. Visin. A guide to convolution arithmetic for deep learning. ArXiv:1603.07285 (2016).