Linux培训

在Spark中使用IPython Notebook

发布：Linux培训
来源：Linux教程
时间：2016-10-25 14:36

当搜索有用的Spark小技巧时，我发现了一些文章提到在PySpark中配置IPython notebook。IPython otebook对数据科学家来说是个交互地呈现科学和理论工作的必备工具，它集成了文本和Python代码。对很多数据科学家，IPython otebook是他们的Python入门，并且使用非常广泛，所以我想值得在本文中提及。

这里的大部分说明都来改编自IPython notebook: 在PySpark中设置IPython。但是，我们将聚焦在本机以单机模式将IPtyon shell连接到PySpark，而不是在EC2集群。

1.为Spark创建一个iPython notebook配置

~$ ipython profile create spark
[ProfileCreate] Generating default config file: u'$HOME/.ipython/profile_spark/ipython_config.py'
[ProfileCreate] Generating default config file: u'$HOME/.ipython/profile_spark/ipython_notebook_config.py'
[ProfileCreate] Generating default config file: u'$HOME/.ipython/profile_spark/ipython_nbconvert_config.py

记住配置文件的位置，替换下文各步骤相应的路径：

2.创建文件$HOME/.ipython/profile_spark/startup/00-pyspark-setup.py，并添加如下代码：

 import os
 import sys
 
 # Configure the environment
 if 'SPARK_HOME' not in os.environ:
     os.environ['SPARK_HOME'] = '/srv/spark'
 
 # Create a variable for our root path
 SPARK_HOME = os.environ['SPARK_HOME']
 
 # Add the PySpark/py4j to the Python Path
 sys.path.insert(0, os.path.join(SPARK_HOME, "python", "build"))
 sys.path.insert(0, os.path.join(SPARK_HOME, "python"))

3.使用我们刚刚创建的配置来启动IPython notebook。

~$ ipython notebook --profile spark

4.在notebook中，你应该能看到我们刚刚创建的变量。

print SPARK_HOME

5.在IPython notebook最上面，确保你添加了Spark context。

from pyspark import  SparkContext
sc = SparkContext( 'local', 'pyspark')

6.使用IPython做个简单的计算来测试Spark context。

def isprime(n):
"""
check if integer n is a prime
"""
# make sure n is a positive integer
 = abs(int(n))
# 0 and 1 are not primes
if n < 2:
    return False
# 2 is the only even prime number
if n == 2:
    return True
# all other even numbers are not primes
if not n & 1:
    return False
# range starts with 3 and only needs to go up the square root of n
# for all odd numbers
for x in range(3, int(n**0.5)+1, 2):
    if n % x == 0:
        return False
return True
 
# Create an RDD of numbers from 0 to 1,000,000
ums = sc.parallelize(xrange(1000000))
 
# Compute the number of primes in the RDD
print nums.filter(isprime).count()

如果你能得到一个数字而且没有错误发生，那么你的context正确工作了!

编辑提示：上面配置了一个使用PySpark直接调用IPython notebook的IPython context。但是，你也可以使用PySpark按以下方式直接启动一个notebook： $ IPYTHON_OPTS=”notebook –pylab inline” pyspark

哪个方法好用取决于你使用PySpark和IPython的具体情景。前一个允许你更容易地使用IPython otebook连接到一个集群，因此是我喜欢的方法。