티스토리 뷰
Spark was created initially to improve on the Map/Reduce model, so existing Map/Reduce developers should definitely give Spark a try! When compared to Map/Reduce, Spark offers a higher level, more expressive API in addition to a rich set of built-in and community libraries. To draw an analogy, if Map/Reduce is like an assembly language, i.e. low level and imperative, Spark in turn is more like a modern programming language with libraries and packages. Spark also provides significant performance improvements over Map/Reduce.
Spark can run in many different environments, ranging from co-existing with Hadoop deployments, to running in a Mesos cluster, and also in a managed service such as Databricks Cloud. In Hadoop environments, YARN is the cluster manager that helps launch and schedule the distributed components of a running Spark application. YARN can multiplex both Spark and MapReduce workloads on the same cluster hardware.
Today there are as many more Java and Python users when compared to Scala users of Spark, hence no knowledge of Scala is necessary. Spark’s programmatic shell is provided in both Python and Scala (Java doesn’t have an interactive shell, so we don’t have that feature in Java). Spark’s SQL features are available from all languages. For those wanting to try something new, the Scala API is always available.
Being able to expose Spark datasets over JDBC/ODBC is one of the most popular features we’ve provided in the last year. These interfaces allow querying Spark data with traditional BI and visualization tools as well as integrating with third party applications. With a single program, Spark allows you to ETL your data from whatever format it is currently in (JSON, Parquet, a Database), transform it, and expose it for ad-hoc querying. This is one of the most powerful concepts in Spark, a unification of what used to take many separate tools.
While memory available in modern clusters is skyrocketing, there are always cases where data just won’t fit in memory. In all modern versions of Spark, most operations that exceed available memory will spill over to disk, meaning users need not worry about memory limits. As an example, Spark’s win of the Jim Gray sort benchmark occurred on a data set many times larger than could fit in cluster memory, and even with this Spark’s efficiency was several multiples higher than other widely used systems.
We chose to pursue the Jim Gray benchmark because it is maintained by a third party committee. This ensures that it was independently validated and based on a set of well defined industry rules. Developer skepticism about benchmarks is warranted: self-reported, unverified benchmarks are often more marketing material than anything else. The beauty of open source is that users can try things out for themselves at little or no cost. I always encourage users to spin up Databricks Cloud or download Spark and evaluate it with their own data, rather than focusing too much on benchmarks.
It’s also important for users to think holistically about performance. If your data spends 6 hours in an ETL pipeline to get it into just the right format, or requires a 3-month effort to accommodate a schema change, is it really a win if the query time is marginally faster? If you need to transfer your data into another system to perform machine learning, is that worth a 10% performance improvement? Data is typically messy and complex, and end-to-end pipelines involve different computation models, such as querying, machine learning, and ETL. Spark’s goal is to make working with complex data in real-world pipelines just plain simple!
[느낀점]
한글로 번역해 본 것은 처음인데. 읽는 것 보다는 훨씬 어려웠던 것 같다. 그 중에도 영어 단어의 뜻에 맞는 한글을 찾아내는 게 제일 어려웠던 것 같다. 시작이 반이라는데... 이제 하나 했으니 둘은 조금 더 쉽지 않을까 싶다.
'BigData > Spark' 카테고리의 다른 글
Spark App 수행시 memory 이슈 (0) | 2015.05.22 |
---|---|
Spark로 WordCount 구현하기. #2 (0) | 2015.05.21 |
Spark로 WordCount 구현하기. #1 (0) | 2015.05.21 |
Apache Spark: Transformations 샘플 (0) | 2015.03.17 |
001. Spark를 설치해서 무작정 돌려보자. (2) | 2015.03.05 |
- Total
- Today
- Yesterday
- linux
- HADOOP
- PYTHON
- 확률분포
- Oracle
- flume
- mongodb
- 데이터과학자
- jenkins
- jackson
- db
- java
- bigdata
- Apache Spark
- Git
- 태그를 입력해 주세요.
- 데이터 리터러시
- mysql
- 책요약
- Learning Spark
- Sqoop
- mongo
- Django
- jenkins2.0
- 알고리즘
- json
- spring
- spark
- Hdfs
- exception
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | |||
5 | 6 | 7 | 8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 |
26 | 27 | 28 | 29 | 30 | 31 |