转自机器之心,原文链接。
这次对话中聊到了自己在开源上的经历和经验,对开源社区以及深度学习开源项目的一些想法和看法。
关于开源的一些想法
机器之心:你参与过 TensorFlow, XGBoost, MXNet 等软件项目,同时也是 Scikit Flow, ggfortify, metric-learn 的作者,可以为大家在其中挑选几个你觉得最喜爱的项目,为大家介绍一下?为什么偏爱这几个项目?
唐源:我对自己参与过的项目都挺喜欢的,从中都学到特别多的东西,也认识了特别多的人。在这里我想简单谈谈我最近在做的 Scikit Flow,也就是现在被放在 TensorFlow.contrib 里面的 TF.Learn 模块,这是一开始我和谷歌的 Illia Polosukhin 一起建立的,现在由于被放在 TensorFlow 里面,谷歌的 TensorFlow 团队开始重视这个模块,也参与了它的发展,这个模块的目的是降低大家使用分布式机器学习和深度学习的门槛,让大家可以像使用 Python 里面的 Scikit-learn 那样很快地建立自己的机器学习和深度学习模型,比如说仅仅几行代码就能使用随机深林、深度神经网络等算法,而且可以很方便地部署到分布式的集群中,从而真正地使用到 TensorFlow 的分布式的优势。这些都是需要对低阶的 TensorFlow API 有深度的理解才能实现的一些功能。数据科学从事者没有必要为了使用最新的算法和技术又花许多时间来学习这些实现的细节,他们可以很快地直接将这些使用在工作和研究中。最近倾向于做这种能够简单易用,统一的界面,像 TensorFlow 这样的软件,有着自己独特的语法和使用方法,这迫使大家花时间学习,我觉得好的东西就应该有简单易用的使用方法。ggfortify 的构建初衷也比较类似,我们当时有太多的重复的代码来一遍又一遍实现同样的功能,比如说给聚类算法的结果用椭圆圈出集群的结果,又比如说对使用不同的 R 包生成的时间序列的分析结果进行可视化。ggfortify 达到的目的就是对比较常用的一些 R 数据分析包的结果进行可视化,避免用户花太多时间学习怎样用 ggplot2 的特殊语法来实现一个常见的可视化。我希望以后这样的软件包越来越多,让研究者和工作者能够不用担心太多实现细节,能够集中精力在他们的主要研究当中,从而在科学和技术上有着更快的实质性突破。
机器之心:可以为大家讲讲的你是从什么样的机缘巧合开始成为一位开源社区的积极贡献者吗?
唐源:大四的时候在一家创业公司实习,公司对开源的政策特别开放,我们用到了各种各样的开源软件,用的过程中发现各种问题以及对用户体验度有着各种不满,Github 上面有地方可以提交建议,但是项目的管理者太忙,我就干脆自己对源代码进行研究然后提交修改,这样养成了一种习惯,遇到问题的第一反应是自己研究研究代码,然后直接自己去解决问题,自己独立阅读代码的能力也就这样慢慢培养出来了。很多时候由于对于某个开源软件特别熟悉,经常在做项目中会想到一些有趣的点子来对项目的性能进行改进和功能进行延伸。
机器之心:是什么让你对开源社区这么有热情?
唐源:我从参与开源软件这个过程中获得了许多帮助,学到了很多,认识了许多志同道合的人,我也希望通过我的贡献来报答社区对我的帮助。我相信给予越多,获得的回报也越多。一开始因为只是工作需要对经常使用的软件进行各种修修补补,逐渐也养成了一种看源代码的习惯,对软件的架构好奇心也越来越强,老是主动去了解某个功能是如何具体实现的,这个过程让我受益良多。比如说我一开始自学了 Python 的基础,没有任何的项目实战经验,数据科学这一行对各种开源软件需要特别熟悉,因为不可能自己有时间把需要的功能自己实现一遍,通常你需要的功能其实都已经在开源软件里面实现了,然而随着对软件的熟悉,我就开始好奇以及研读具体的实现细节,这也让我打下了很好的对编程以及软件架构的基础,对一个编程语言的了解也逐渐深入了。比较大的开源项目比如说 pandas,有许多开发者在维护以及审阅新的贡献的代码,他们很认真的审阅了我提交的每一行代码,给予了很多很好的改进建议,这让我养成了许多写软件的好习惯。
机器之心:你在 Github 上非常活跃,Github 也是现在最流行的开源项目平台,在国外使用已经很普遍,一些学习相关专业地同学甚至都会习惯把学过的课程的作业代码保存在 github 上;但对于国内一些听过开源概念,下载过开源包,却从来没有尝试过参与开源项目的同学来说,还是比较陌生的,甚至他们对于 Git 的 workflow 都不是很了解,可以为这些同学简单讲讲 Github 开源项目的参与流程,以及开源社区的文化吗?
唐源: 大部分的开源软件的发展都是在 Github 上完成的,大家需要去自己熟悉和实践一下 Git 的用法,Github 的开源项目里面一般大家都在 issues 里面讨论一的中会想到一些有趣的点子来对项目的性能进行改进和功能进行延伸。大家可以根据自己的兴趣爱好和需求选择自己想贡献的开源项目,fork 这个项目,对项目进行改进和修改,然后提交 pull request 来让项目管理者进行代码审阅。我最先开始参与的开源软件是 Python 的 pandas 库,它有着特别活跃的贡献者社区,也有着特别有耐心有帮助的项目管理者,每个 issue 都会被标上难易程度以及是否欢迎第一次贡献者,这样大家就可以先选择简单的 issue 来下手,开始熟悉这个贡献的流程。我的第一次代码贡献花了一周才被接受,项目管理者特别细心地对我写的每一行代码进行审阅和评论,保证我的改动不会影响现有用户以及要求新的代码必须有新的单元测试。我在这个过程中养成了很多很好的开发习惯也认识了世界各地的朋友,也希望我的这些建议能帮助到大家。
机器之心:你一方面在 Uptake 带领团队,另一方面又是开源社区的活跃贡献者,从公司以及社区成员的两个不同的角度看,将项目开源的优势和劣势分别是什么呢?
唐源:在参与这些开源项目的过程中,我深刻地体会到这个社区的活跃度以及创新能力。拿 XGBoost 举个例子,一旦有人在竞赛中使用 XGBoost 取得好的名字,这个项目也就得到大家的认可,更多的人就愿意尝试使用,甚至使用在公司核心的软件当中,这样一来,广大的社区能够很好的测试这个产品,我们也能够通过 Bug report 和 Feature request 来更好的理解用户的需求和改进产品,这个过程中几乎所有的对话都是公开透明的,大家都可以参与到这个过程当中,有时候一些小小的代码错误能够被广大开源社区发现,软件的性能甚至突然得到成倍的提升。对公司来说,开源代码也意味着大家都知道你的公司在用什么样的算法,其它公司也可以模仿使用,这样公司间的竞争更大。由于代码都是开放的,很多时候甚至有安全隐患,不过我相信公司开源代码前都是对这些隐患有深刻的理解和应对方式的。
机器之心:一般什么样的项目会被开源?一个项目被开源意味着什么?
唐源:这个有很多情况,比如说有些在公司使用了很多年的代码突然决定开源,这有可能是这个项目的某些员工离开了或者是算法过时了,没必要再保密了,公司决定不再花时间维护,而将维护和测试的责任交给广大的社区用户。也有可能是很新的项目,为了减少自己维护的成本,将项目开源,这样大家可以更全面的测试,也能够及早得到用户的反馈意见,打造更加满足用户需求的产品。也有公司的开源是战略性的,想让大家知道自己在这方面是专家,等等。不同的公司有着不同的开源软件维护方式,政策,和原因。一旦项目被开源,这也代表着所有的代码,算法,以及实现细节都是透明的,这意味着所有的竞争者都知道这个公司使用的技术以及在某个细节上的实现,这有一定的安全隐患,但是我相信在开源前这些都是考虑透彻了的。对员工来说,一个项目被开源意味着自己多年来在公司的心血和贡献都能够被大家知道,也让自己的工作更有动力。
机器之心:很多人无法理解开源的理念,认为开源就是免费,和商业利益是相冲突的,你怎么看待这个问题?
唐源:开源往往是战略上的决定,往往是和商业利益上是没有冲突的,我之前也提到过,开源一个项目可以为这个项目节省不少成本,也可以给这个公司节省招聘合适人才所需要的开支,因为你可以招到已经熟悉公司所使用技术的员工,当员工加入公司时,不需要再花时间和精力来进行培训了。
机器之心:另外一些人认为公司如 Google、Facebook 和微软做开源的目的是为了垄断行业,你觉得这些担忧是有必要的吗?我们应该如何应对呢?
唐源:我觉得这个没有必要担心,技术的更新换代太快了,可能今天比较火的技术和开源项目,明天就因为某个原因停止维护以及失去竞争力了,比如说以前比较火的 DeepLearning4J,因为有着比如像 MXNet 的 Scala 接口这样的竞争者,有着更好的性能也有着更了解用户需求的 DMLC 成员的维护,相信现在已经很难再有竞争力了。我们不应该担心这些,我之前也稍微提到的一点就是大家喜欢比较不同的框架,从而不断学习这些不同来改进自己的产品让它更有竞争力,互相竞争是很好的一种现象。我们能做到的就是使用自己喜欢的软件,来达到自己的研究和工作的需求,与时俱进。
机器之心:谷歌、Facebook 还有 OpenAI 这些组织做了很多开源,你们也在做,你们认为开源对人工智能技术和社区的发展有怎样的作用?
唐源:开源可以让人工智能的研究的结果更有重现性然后能够更方便地让研究学者们分享研究结果。比如说我之前谈到的 TF.Learn,谷歌最近甚至最近的一篇论文是使用它来实现的,算法的实现也成为了 TF.Learn 里面的一个 Estimator,这样其它的业界人士也能直接使用到他们的工作研究当中,论文的结果也能很容易地再次得到。这些公司的开源项目都让大家有更多的学习资源,让大家有更好的工具来帮助自己的学习,工作,以及研究。这些开源项目也给了全世界各地的朋友互相交流,学习,以及一起开发产品的机会,我觉得是特别宝贵的,在芝加哥我们经常举行线下的见面会来进行交流,这样可以极大地扩展自己的视野,也认识到了一群志同道合的朋友。
机器之心:谷歌给你颁发了 Open Source Peer 奖,能给大家介绍一下这个奖对你的意义吗?你接下来打算做什么?
唐源:这个奖首先是由谷歌内部员工提名推荐,然后再经过内部审核和讨论得到最后的获奖人名单的,我通过持续的贡献得到了他们的注意以及肯定,这是对我的一个很大的鼓励以及对我的贡献的认可。首先我会继续活跃在这个社区中,维护和继续贡献参与的软件,帮助大家解答使用软件时遇到的各种问题。很多时候在 Github Issues 和 StackOverflow 上会发现许多有趣的主意或者是某个人的问题和回答会激发新的灵感。然后我也一直在观察这一行的需求,其实有很多东西都是可以做的。我比较感兴趣的是那种能够让大家工作更有效率,让工作更不那么重复和繁琐的项目。
关于 DMLC
机器之心:在你眼中,DMLC 是一个什么样的组织?是什么样的契机加入了 DMLC?
唐源:简单来说,DMLC 是为了帮助大家更方便使用一些最新的算法和技术,降低大家进入这一行的门槛。我们想把最先进的技术带给大家,这样感兴趣的朋友可以不必再花费时间来重新实现这些技术,从而可以直接应用这些技术到他们的研究和工作当中,集中精力在已有的技术上进行突破。我们觉得好的东西应该要分享给大家,这样可以提高大家的效率,也可以加快研究领域上的突破。我最先开始是一直在改进和延伸 DMLC 的 XGBoost 项目,比如说给 Python 包做了许多小的功能上的延伸,其中有不少的需求都是来自现在比较火的数据科学竞赛 Kaggle 用户,一些 DMLC 成员经常参加 Kaggle 里面的论坛,来帮助大家更好的使用 XGBoost 来满足他们各种创新的建模需求。在天奇的邀请下,我成为 XGBoost 的 committer,然后也就自然而然地花更多的时间在维护这个项目,后来又参与了 MXNet 的 Scala 接口的建设。
机器之心:知乎上有人提到 dmlc 的初衷好像是提供一套比较简单易用的 python 的接口,这种说法对吗?那么你们设计 mxnet 的初衷是什么?
唐源: 我觉得 Python 接口只是让好的技术能够让更多人使用的一个很好的开始,Python 在数据科学和机器学习领域是非常火的,用户特别多,开源社区也比较活跃,是个很好的选择,大部分的深度学习领域的研究者都是 Python 用户,尤其是设计到图像处理,文本处理等领域。但是仅仅有 Python 接口是满足不了需求的,很多社会科学和生命科学领域经常使用 R,、Julia,在产业界最广泛使用的语言是 JVM 类的,比如说 Java 和 Scala,这也是我们后来为什么又把许多精力放在了其它语言上,比如说我花时间最多的 Scala 接口。我觉得一个好的产品不能一次性实现各种语言以及各种需求,我们首先用 Python 接口来做实验,看看需要满足用户的哪些需求,看看我们的方法能不能行得通,会不会受大众喜欢,接下来的其它的接口也都是看需求来的,这样我们能够更有效地利用时间。
机器之心:随着深度学习越来越受关注,最近对几大框架的比较是越来越多,这种现象是好是坏呢?
唐源:我觉得这是非常好的现象,有竞争更能推进进步,大家都开源自己的独特框架,这更有助于学习、研究以及交流。DMLC 的天奇做的 nnvm 就很有学习价值,这是一个很轻量级的模块,实现了许多在深度学习系统比如说 TensorFlow 和 MXNet 中存在的对计算图进行优化,以及处理前后端,等等的一些需求。我非常开心能看到大家这样愿意将自己的研究成果开源,这样让后来的学习者少走了不少弯路。比较是好的,这样大家能看到自己的框架的优缺点,但是与此同时,因为各种条件的限制和变化,往往我们很难进行比较公平的比较。有些人喜欢拿自己框架擅长的方面来定制自己的衡量标准,希望大家能够全面的进行比较,不要以争取用户为目的来比较框架。
机器之心:谷歌、Facebook 作为大公司,他们的 TensorFlow、Torch7 好像更受欢迎,这是不是资源优势造成的呢?你如何看待现在业界的反 TensorFlow 呼声以及支持 TensorFlow 的呼声?
唐源:资源优势确实影响到了受欢迎程度,但也不是绝对的,因为最终的受欢迎程度是需要经过时间的考验的,最终的决定权还是在用户手上,有着不同需求的用户群体可以更好地全面地测试不同的框架,不管是大公司支持的还是草根阶级的,都会有着竞争的机会,不会被完全垄断的。我觉得不用太在意这些,自己了解自己项目的需求,然后考虑一下学习成本,看看哪个框架更适合自己就好。比如说很多人就不太喜欢学新的编程语言就仅仅为了使用 Torch7,很多统计学家熟悉 R 的话可以直接使用 MXNet 的 R 接口,等等。
关于 机器之心
机器之心,创办于2014年3月,是中国领先的人工智能垂直媒体。机器之心关注人工智能,机器人,神经认知科学等前沿科技,通过提供专业、权威的信息让读者紧跟科技潮流。在专注技术内容的同时,机器之心也关注人与科技之间的深度思考,启发读者充分想象科技发展与人类进程的关联,构建良好的创新环境。
机器之心是业界公认的最专业的专注于人工智能的华语新媒体;作为一个针对业界的AI主题垂直媒体,机器之心打破了垂直业界媒体的影响力瓶颈与局限,异军突起。目前机器之心已经覆盖了微信、今日头条、百度百家、腾讯内容开放平台等多个大型内容平台,并运营着自己的官方网站;现已拥有超过11w微信订阅用户, 百度、今日头条等其他内容平台每天的浏览量合计超过3W.
2015年,机器之心获得「虎嗅2015年度十佳作者」和「今日头条2015年度最佳自媒体」的奖项,并成为 APEC GIC 「全球VR峰会」首席科技媒体,瑞士洛桑The Brain Forum唯一受邀进行报道的中国媒体。 2016年,机器之心是WithTheBest AI Conference唯一来自中国的官方合作伙伴。
与此同时机器之心作为联想之星Comet Labs全球资源平台的一份子,携手联想之星Comet Labs共同为人工智能领域的参与者提供产业及创业加速服务。
Copyright © Yuan Tang 2024
This interview transcription, translated by Wenxuan Li, is cross-posted from www.jiqizhixin.com. The original Chinese post can be found here.
During this interview, Yuan Tang talked about his experience with open-source. Further, he mentioned his opinions and thoughts on open-source community and deep learning projects.
Thoughts on Open-source
Synced: We understand that you took part in TensorFlow, XGBoost, MXNet and other software open-source projects. Meanwhile you are the author of Scikit Flow, ggfortify, and metric-learn. Could you please introduce some of your favorite projects and talk about why you like them the most?
Yuan Tang: Actually I’m a big fan of all the projects that I was part of. I managed to learn a lot from those projects and meet a lot of interesting people during the time. Today I’d like to talk about Scikit Flow, a project that I have been actively spending time on. It is now part of TF.Learn module in TensorFlow.contrib. Illia Polosukhin from Google and I first started this project. After being part of TensorFlow, this project was being invested, used, and enhanced by TensorFlow team by incorporating it into many areas of Google’s research projects and products. The purpose of this project is to lower the barrier of using TensorFlow and deep learning, just like how we would use Scikit-learn in Python to quickly build personal machine learning and deep learning models. Through Scikit Flow, true advantages of TensorFlow’s distributed support could be revealed and utilized. For instance, users could access Random Forests, Deep Neural Networks and other algorithms by typing only a few lines of code, and could be easily distributed across multiple devices and clusters. However, these could be achieved only when users have thorough understanding of lower level TensorFlow APIs. I think it is not necessary for individuals working in data science field to spend time on the details. Rather they could quickly use and incorporate TensorFlow into their research and study. Recent trends lead to easier and accessible software interface. Softwares like TensorFlow require users to spend time learning unique algorithms and detailed usages provided by the software. In my opinion, wonderful things such as TensorFlow should have simpler and easier way to be accessed and studied. For similar reasons, ggfortify was created to help users get quick and easy access to R’s plotting functionalities. We used to have repetitive code to achieve identical purposes, such as plotting eclipses for clustering results and visualizing results from different R packages. What ggfortify offers is that users are able to quickly visualize popular R packages’ results, and to avoid spending time on learning unique algorithms of each package only to get a common visualization. It is my hope to see more and more packages like ggfortify so that users could skip the complex learning process and focus only on the core of their research and study, and more substantial breakthrough would be achieved.
Synced: Could you please talk about what led you to become an active contributor for the open-source community?
Yuan Tang: When I was a senior in college, I worked as an intern at a startup, which had a very open policy regarding open-source. I was able to get to know all kinds of open-source softwares, but through the process I noticed that users would have complaints regarding the softwares they were using. Even though they could submit issues and bug reports on Github, the project maintainers were too busy to deal with those. In that case, I learned to study the source code and submit revisions to fix the issues. Later on it became something I would naturally do. When I came across some problems, I would dig into and manage to solve the problems myself instead of just submitting reports and leaving them to someone else. Through that process I got to practice and improve my code reading skills. From time to time, I was able to come up with interesting ideas to improve the projects.
Synced: What makes you so passionate about the open-source community?
Yuan Tang: Since I became a contributor in the community, I have received help and guidance from others and learned a lot. I have also met individuals that share same goals with me. In return, I hope I could contribute back to the community. I believe that the more I contribute, the more I would receive back from the community. At first I would try to fix the codes only because my job requires me to do so. However, later I made a habit of trying to solve the problem from its source. By doing so, I’m more interested in the architecture of the softwares and in how a particular function is implemented. This whole process has been proved to be very beneficial. For example, when I first learned Python, I had no actual experience with any projects. Individuals who work in data science field should be very familiar with all sorts of open-source software, because these include important functionalities the users would need. When I tried to study the details of the project, I got to know more about certain programming languages and technologies used behind the scenes. Open-source contributors would constantly maintain and review newly contributed codes. They would carefully read every line of the code I submit and give insightful comments, which also encourage me to write better code.
Synced: You are very active on Github, which is most popular platforms for version control open-source projects. Students in related fields would even save their homework and projects on Github. However, Github is still something rather new to individuals in China who have only heard about open-source and downloaded open-source packages but never have tried to contribute back to open-source projects. Could you talk about how the open-source projects on Github work and what open-source culture is like?
Yuan Tang: Most of open-source projects are developed on Github. Everyone should try to have hands-on experience with Github. Through discussion on issues tab on Github, people would come up with interesting ideas to improve and extend the existing projects. Then they will submit pull requests for project managers to review code. Python’s pandas, which has an active open-source community and helpful project managers, was the very first open-source projects that I committed. Every issue would be labeled with difficulty levels and whether first time committers are welcomed. In that case, the contributors could start with easy tasks and move on to more difficult ones. It took a week for my first contribution to be accepted. Project maintainers carefully read my code and gave detailed comments, making sure that my commits won’t affect the current users. As I mentioned before, I have learned a lot from the process and got to know many friends who share same interests and goals with me.
Synced: As a team lead at Uptake, you also work as an active contributor in the open-source community. Could you please talk about the advantages and disadvantages of having open-source projects from the perspectives of both role you are playing?
Yuan Tang: While I have been working on these open-source projects, I deeply feel the passion and creativity of the community. Take XGBoost as an example: if someone uses XGBoost to get a good place in a competition, the project itself will be known and recognized by the public, and more people would be willing to try to contribute to the project. Some of them would ever apply it to their companies’ projects. In that way, this project could be better tested by more users. Bug report and feature requests would further help us understand users’ needs and improve the projects. Almost all the conversations through the process are transparent and open to the public. One advantage is that even a minor mistake would be found and corrected by one in the community, let alone more progress and improvement made by the whole community. To a company, having open-source projects for the public implies that everyone would know the algorithms this company is using, meaning more competition among companies and entities. Information security could be another issue. However, I believe that companies which have decided to publish open-source projects would have already prepared for security issues and already have solutions for the potential problems.
Synced: Usually what kind of projects would become open-source projects? What does it mean to be an open-source project?
Yuan Tang: It depends on the situation. One case would be when some companies decide to publish codes they have been developing for years. The reason behind this case could be outdated codes or sudden leave of employees who used to work on the code. As a result, there is no need to keep the codes as secrets and spend more resources in maintaining and testing the codes. Then the companies decide to let the open-source community to take over the job. In another case, brand new projects could become open-source projects because company wants to lower the cost and receive more comprehensive feedbacks from the public. It is also possible that a company regards an open-source project as an strategic plan to let other competitors or users to know that it is the experts in certain field. Different company has different project maintenance strategies and purposes. Once a project become an open-source project, it is implied that all the associated code, algorithms, and implementations details are completely transparent. It also means that all the competitors would learn more details about the company that created the project. To the developers whose projects become open-source, many other individuals would know their efforts and contribution to the projects.
Synced: Some individuals think that open-source means free work and free contribution, which conflicts with profitable entities. What do you think?
Yuan Tang: Open-source, to a company, is usually a strategic plan, and it does not conflict with profitability of the company. As I mentioned earlier, an open-source project would help the company save costs in maintaining project and searching for talented employees, since open-source project help the company identify those who are capable of improving the project. When the potential employees join the company, less costs would be spent on training.
Synced: Some others think that companies like Google and Facebook intend to use open-source projects as an act of monopoly. Do you think this is something we should worry about? How should we react to their decisions?
Yuan Tang: I don’t think this is something we need to worry about. Technology renews itself in a rapid speed. Open-source projects that are popular today might not gain as much attention tomorrow. DeepLeaning4J is a perfect example. Because of competitors such as MXNet’s Scala, which have better functionality and greater DMLC team members to work on maintenance, DeepLearning4J gradually loses its competiveness and advantages. Competition drives one another to learn from the mistakes and thus make their own products better. Therefore I won’t be worrying about monopoly. What we could do is to use software we like to meet the need of research or work while staying in the track of innovation and progress of the broader society.
Synced: Google, Facebook, and OpenAI have created many open-source projects. The company you work at is not an exception. In your opinion, what are the impacts of open-source on AI technology and community?
Yuan Tang: Open-source not only assists researchers with sharing their research progress and results, but also help AI research results to be more reproducibility. Take TF.Learn as an example, Google recently published a paper that demonstrates a novel algorithm that uses this software so they also included the implementation of this novel algorithm inside TF.Learn as one of the Estimators In this case, individuals from other fields of study could apply TF.Learn into their research and work or reproduce the research results in their publications. Open-source projects from the companies you mentioned provide more learning resources and tools for work, research, and study of individuals. These projects also help people from across the world to learn from one another and communicate.
Synced: You recently received Open Source Peer Award from Google. Could you please talk about this award and the importance of this award to you? What is your plan after receiving this award?
Yuan Tang: As one of the individuals who received this award, I was nominated by Google employees who knew the importance of the work I have done with this project. Then I was lucky enough to be able to stay on the award-wining list after discussion and examination of management team at Google Open-source Office. This award implies Google’s affirmation to the contributions I have made to the project. It is also encouraging for me to receive this award. My plan is to continually make contribution to this community, including maintaining projects, and helping others solve problems and issues they encounter. Many times I would discover interesting ideas and thoughts on Github Issues and StackOverflow, and thus I would be inspired to have new and innovative ideas. Also I think there is a lot we can do to contribute to this open-source community. My personal interests include projects that help others’ work to be more efficient and less repetitive.
About DMLC
Synced: In your opinion, what is DMLC? What led you to DMLC?
Yuan Tang: In a few words, DMLC was created to better serve individuals to apply new algorithms and technology into their own research and work. We intend to help those who are interested save time in repeating the process but rather directly apply it into their work and research. In that way, they can focus on the breakthrough of their own work instead of on learning algorithms and techniques that they are not that good at. We think that good things should be shared with everyone. At the beginning, I worked on improving and extending DMLC’s XGBoost project, including improving Python’s packages. I learned the need for improvement from users in Kaggle data science competition. Some of the DMLC members actively contribute in the discussion at Kaggle in order to meet the need of XGBoost users. Later on I worked on the creation of MXNet’s Scala binding.
Synced: Someone on Zhihu mentioned that DMLC was created to provide a more applicable API for Python. Do you agree with this? If not, what do you think is the main purpose of creating DMLC?
Yuan Tang: I think Python API would offer many individuals a good start in using Python. Python is very popular in data science and machine learning areas. It also has many users and a very active open-source community. Many researchers in deep learning area are users of Python, especially who work on photo processing and text processing areas. However, Python API does not meet all the needs of users. Many individuals in social science and life science areas often use R and Julia, while others in industrial world use JVM, including Java and Scala. Therefore we would spend time on the development of other languages, such as my focus on Scala API. I think it is very difficult for one product to meet needs from many programming languages. We would use Python API as an experiment to see if users would like it that way.
Synced: As deep learning becomes more popular, people tend to compare the frameworks with one another. Do you think this is a good sign?
Yuan Tang: I think this is a positive thing because competition drives improvement. By developing unique frameworks, we would be able to better learn, research, and communicate. Tianqi Chen from DMLC created NNVM, which is a very valuable and lightweight module that implements most of the computational graph construction and optimization used very often in deep learning systems such as TensorFlow and MXNet. I’m very happy to see many individuals contribute their own research projects to the open-source community. It saves a lot of time for people in research and industries. What’s even more wonderful is how people would see the advantages and disadvantages of their frameworks. Meanwhile it is very challenging to have absolute fair comparison because of limited conditions, prerequisites, resources and of course, forever changing environment. I hope we would strive to have fair comparison instead of only aiming for attracting more users.
Synced: As influential companies such as Google and Facebook, their TensorFlow and Torch7 projects seem to be more popular than other equivalent ones. Do you think it is the result of resource distribution? What do you think about the opinions of supporting or disapproving TensorFlow?
Yuan Tang: Distribution of resources does have impacts on popularity of open-source projects, but it is not the end of day for projects with less resources. I think time would test every open-source projects, and those are truly good would survive the trials. Users with different backgrounds and needs would comprehensively test different frameworks from all sorts of projects, and every framework stands for a chance to be discovered and welcomed by the users. I think we can focus on understanding needs, considering opportunity costs of learning, and finding the ones that fit ourselves the best instead of worrying about the impacts of resource distribution. Some do not like learning new programming language only for using Torch7. Some statisticians would directly use MXNet if there are familiar with R.
About Synced (www.jiqizhixin.com)
Synced (www.jiqizhixin.com) was founded in March 2014. It is the first Chinese media company that focuses on the news and insights of cutting-edge technologies. As the industry’s most influential media leader, we provide professional, authoritative, and inspiring content in the fields of Artificial Intelligence, robotics, neuroscience and bio-medicine to our readers. Our readers are those who wish to learn more about the cutting-edge technologies and the upcoming technological revolution. Our vision goes beyond merely inspiring the readers’ imagination and creativity; it is to help them gaze into the future interactions between mankind and technologies.
As of the spring of 2016, over 100,000 readers have subscribed Synced’s WeChat channel, hosted by the largest Chinese social networking and content sharing platform. The daily visits average 22,000. On top of that we also have a combined 30,000 daily page views via other internet platforms, including Baidu and Tou Tiao.com. The feedback collected from our subscribers and third parties indicates that our topic selection, quality-control, professionalism, practicality, and thoughtfulness are recognized and praised by the executives of major Chinese technology companies, as well as many experts, scholars, investors, industry practitioners, students, and technology enthusiasts. As the key content provider to Tencent, Baidu, and Tou Tiao, we have won the title of Huxiu’s Top Ten Content Provider of 2015 and Tou Tiao’s Best Vertical Media of 2015. We have also served as APEC Global VR Summit’s chief technology media.
Synced has conducted in-depth interviews with numerous domestic and international experts, such as Deng Li (Microsoft Research’s Chief Artificial Intelligence Scientist), Richard Sutton (University of Alberta, RLAI), Hsiao-Wuen Hon (President of MSRA), Randy Schekman (Nobel Prize Laureate), and John Markoff (New York Times senior technology reporter and the author of Machines of Loving Grace). We have also covered IBM, Microsoft, Baidu, as well as many other outstanding AI and robotics companies and start-ups in China and around the world.
Copyright © Yuan Tang 2024