What changed in the Big data landscape from 2013 to 2019
自2013到2019年大数据领域发生了什么变化
作者:Abbass Marouni 翻译:helight 原文地址:https://blog.marouni.fr/bidata-trends-analysis/译者序
在网上看到这篇文章之后发现还挺有意思,文章也算比较简短,就试着联系了一下作者说:我想把他翻译成中文,不做商业用途只是练习和技术布道。作者的回应也非常快,当晚就给我回复,所以就有了这篇翻译,如果翻译有不准确的地方还请大家指出。背景
I’ve been a loyal follower of Data Eng Weekly newsletter (formerly Hadoop Weekly) for the past 6 years, the newsletter is a great source for everything related to Big data and data engineering in general with a wide selection of technical articles along with product announcements and industry news. 过去6年中我是Data Eng Weekly(前身是Hadoop Weekly)的忠实粉丝,它是一个与大数据和数据工程相关的所有消息的很好的来源,它包括了大量的技术文章以及产品公告和行业新闻。 For this year’s holidays side project I decided to analyze Data Eng’s archives, that go back to January 2013, to try to analyze Big data trends and changes over the past 6 years. 在今年的假期项目中,我决定分析Data Eng的以往内容,追溯到2013年1月,尝试来分析过去这6年终大数据发展趋势和变化。 So I crawled and cleaned over 290 weekly issues (well python did !), I kept articles’ snippets from the technical, news and releases sections only. Next, I ran some basic natural language processing followed by some basic filtering to produce keywords mentions and all of the plots that follow. 所以我使用python抓取清洗了290多期的内容,只保留了和技术、新闻和发布相关的部分内容,接下来,我对这些内容使用一些基本自然语言处理,然后进行基本过滤生成下面的关键词和内容。Major trends over the last seven years
过去7年里主要的发展趋势
Let’s start with the major trends over the last seven years, here I’m plotting the monthly rolling mean of the number of mentions of specific keywords and plotting them together on the same graph. The following plots illustrate at what approximate time frames technologies become more popular (as a result of more reporting about these technologies) when compared together. 从过去7年的主要发展趋势开始,这里我把特殊关键词被提到的次数按照月滚动平均值绘制到一张图中。下面的图表展示了通过对技术报道的统计,对比显示出那些时间段那些技术变得更加流行。Hadoop vs. Spark
Hadoop与Spark

Hadoop vs. Kafka
Hadoop与Kafka

Hadoop vs. Kubernetes
Hadoop与Kubernetes

Yearly top keywords
年度热门关键词
Here I’m simply plotting the top 10 keywords by total number of mentions in a give year. 这里我简单的画出了给定年份中被提及次数最多的10个关键词。2013 : Hadoop’s golden year !
2013:Hadoop的黄金年代!

2014 : The rise of Spark !
2014:Spark的崛起!

2015 : Here comes Kafka !
2015:Kafka来了!

2016 : Streaming is on fire !
2016:流式计算火了!

2017 : Stream everything !
2017:一切皆流计算!

2018 : Back to basics !
2018:回归基础!

2019 : …

Code and dataset
代码和数据集
I’m working on cleaning up the code so that you can generate the dataset by yourself. I’ll also be posting the NLP python snippets along with Bokeh & Seaborn plot generating snippets, so stay tuned. 我正在整理代码,之后你就可以自己生成数据了。我还会发布NLP的python代码片段,连同用Bokeh和Seaborn画的图片,所以请持续关注。版权声明
本文仅代表作者观点,不代表本站立场。
本文系作者授权发表,未经许可,不得转载。
评论