Monthly Archives: September 2018

Python, Boto and AWS EC2

- - Python, Tutorials

Most if not all software companies have adopted to cloud infrastructure and services. AWS in particular is very popular amongst all. The intentions of this post is to host a few examples on using boto to make use of one of the services available on AWS i.e EC2. It is more likely than not to have need of a mechanism to programatically fire up a few instances, shut them down, filter instances and send remote commands to it to say the least.

Filter instances based on tag names from the AWS inventory

EC2 instances on AWS can have as many tag names key: value as required for purposes like identifying an instance or a set of instances. Also when the instance you are working on quite frequently needs to shut down and boot over again and you haven’t implemented elastic IP, you are bound to changes in the public IP address. Although you could argue to use private IP to filter an instance, it isn’t very effective when you have a lot of instances(>100).

Boto2
import boto.ec2

conn = boto.ec2.connect_to_region('us-east-1', aws_access_key_id='aws_access_id', aws_secret_access_key='aws_secret')
reservations = conn.get_all_instances(filters={'tagName' : 'value'})
public_ips = [each_instance.ip_address for r in reservations for each_instance in r.instances]
# each_instance.private_ip_address  to get the private ip address of the instance
Boto3
import boto3
session = boto3.session.Session(aws_access_key_id=aws_access_id,
                                aws_secret_access_key=aws_secret,
                                region_name='us-east-1')
 
ec2 = session.resource('ec2')
instances = ec2.instances.filter(
    Filters=[{'Name':'tag:purpose', 'Values':['intelligence']}
])
public_ips = [each_instance.public_ip_address for each_instance in instances]
# each_instance.private_ip_address to get the private ip address of the instance
Boot/Shutdown an instance/instances from the AWS inventory

Using boto, you can boot/shutdown/terminate instances.

Boto2
def start_stop_terminate_instance(instance_ids, conn, action='start'):
    if action == 'start':
        conn.start_instances(instance_ids=instance_ids)
    elif action == 'stop':
        conn.stop_instances(instance_ids=instance_ids)
    elif action == 'terminate':
        conn.terminate_instances(instance_ids=ids)
Boto3
def start_stop_terminate_instance(instance_ids, conn, action='start'):
    if action == 'start':
        conn.instances.filter(InstanceIds=instance_ids).start()
    elif action == 'stop':
        conn.instances.filter(InstanceIds=instance_ids).stop()
    elif action == 'terminate':
        conn.instances.filter(InstanceIds=instance_ids).terminate()
Create Instances based on various metrics

Boto makes use of the AWS APIs that also allows creating instances. An EC2 instance can have various properties. The most common is the type of the instance. Types are generally a grouping of instances based on metrics such as power, performance, bandwidth. Commonly used types for general purpose are t2, m4, m3. C5, c4, c3 are compute optimized instances. For a process/application more leaned towards in-memory activities, you’d use x1, r4, r3. There are other types too but the above mentioned are quite common in use. The other properties of an instance are instance id, the memory size (micro, nano, small, large, xlarge, 2xlarge, 4xlarge, 8xlarge, 10xlarge.), the key pair to make a secured connection to the instance, tag names, display names, security groups, attached storage id, etc. Using boto we can create an instance or multiple instances based on the above mentioned parameters.

Boto2
import boto.ec2
conn = boto.ec2.connect_to_region('us-east-1', aws_access_key_id='aws_access_id', aws_secret_access_key='aws_secret')
conn.run_instances(
    'ami-ag139jf',
    min_count=10, 
    max_count=100,
    key_name='myKey',
    instance_type='t2.small',
    security_groups=['sg-4512']
)
Boto3
import boto3
session = boto3.session.Session(aws_access_key_id='aws_access_id',
                                aws_secret_access_key='aws_secret',
                                region_name='us-east-1')
 
ec2 = session.resource('ec2')
ec2.create_instances(
    ImageId='ami-ag139jf', 
    MinCount=10, 
    MaxCount=100, 
    InstanceType='t2.small',
    KeyName='myKey',
    SecurityGroups=['sg-4512']
)
Send remote commands to an EC2 instance

Paramiko can be used for connecting to a remote instance and sending commands to be executed and get the standard output/error to act accordingly.

import paramiko

key = paramiko.RSAKey.from_private_key_file(path_to_pem_file)
client = paramiko.SSHClient()
client.set_missing_host_key_policy(paramiko.AutoAddPolicy())

# Connect to the instance
try:
    # using username, public ip address and the pem file, create connection to the instance
    client.connect(hostname=instance_ip, username="ubuntu", pkey=key)

    # Execute command remotely.
    stdin, stdout, stderr = client.exec_command(“ls -l”)
    print stdout.read()
    client.close()

except Exception, e:
    print e

Web Scraping using Golang

- - Golang, Tutorials

Web Scraping can be beneficial to individuals and companies. The intentions of this post is to host a set of examples on Web Scraping using Golang and goquery. I will be using github’s trending page https://github.com/trending throughout this post for the examples, especially because it best suits for applying various goquery methods. There are two other versions of this article which replicates the same set of examples in Python and NodeJS.

Installation

go get github.com/PuerkitoBio/goquery

Get html of a page
package main
import (
    "log"
    "io"
    "os"
    "net/http"
)

func ScrapeHTML(){
    resp, err := http.Get("https://github.com/trending")
    if err != nil{
        log.Fatal(err)
    }
    defer resp.Body.Close()

    if resp.StatusCode != 200{
        log.Fatalf("status code error: %d %s", resp.StatusCode, resp.Status)
    }
    io.Copy(os.Stdout, resp.Body)    
}

func main(){
    ScrapeHTML()
}

Using goquery(golang library) to get title from a page

package main
import (
    "fmt"
    "log"
    "net/http"
    "github.com/PuerkitoBio/goquery"
)

func ScrapeHTML(){
    resp, err := http.Get("https://github.com/trending")
    if err != nil{
        log.Fatal(err)
    }
    defer resp.Body.Close()

    if resp.StatusCode != 200{
        log.Fatalf("status code error: %d %s", resp.StatusCode, resp.Status)
    }

    doc, err := goquery.NewDocumentFromReader(resp.Body)
  if err != nil {
        log.Fatal(err)
    }
    fmt.Println(doc.Find("title").Text())
   
}

func main(){
    ScrapeHTML()
}
Output

$ go run example.go
Trending repositories on GitHub today · GitHub

Using goquery, Find single element by tag name, find multiple elements by tag name
package main
import (
    "fmt"
    "log"
    "strings"
    "net/http"
    "github.com/PuerkitoBio/goquery"
)


func scrapeUsingTagNames(){
    resp, err := http.Get("https://github.com/trending")
    if err != nil{
        log.Fatal(err)
    }
    defer resp.Body.Close()

    if resp.StatusCode != 200{
        log.Fatalf("Status code error: %d %s", resp.StatusCode, resp.Status)
    }
    doc, err := goquery.NewDocumentFromReader(resp.Body)
    if err != nil{
        log.Fatal(err)
    }
    fmt.Println(doc.Find("title").Text())

    doc.Find("ol li").Each(func(i int, s *goquery.Selection){
        fmt.Println(strings.TrimSpace(s.Find("h3").Text()))
    })
}

func main(){
    scrapeUsingTagNames()
}
Output


$ go run example.go
Trending repositories on GitHub today · GitHubAsset 1Asset 1
you-dont-need / You-Dont-Need-Momentjs
ripienaar / free-for-dev
Nozbe / WatermelonDB
cjbarber / ToolsOfTheTrade
byoungd / English-level-up-tips-for-Chinese
TheAlgorithms / Python
thedaviddias / Front-End-Checklist
zziz / pwc
dawnlabs / carbon
CyC2018 / CS-Notes
Avik-Jain / 100-Days-Of-ML-Code
donnemartin / system-design-primer
mariusandra / pigeon-maps
Snailclimb / JavaGuide
JavaNoober / BackgroundLibrary
crossoverJie / JCSprout
Microsoft / nni
PansonPanson / Java-Notes
date-fns / date-fns
sindresorhus / ky
mciastek / sal
rwv / chinese-dos-games
vuejs / vue
GoogleCloudPlatform / open-match
lin-xin / vue-manage-system

Getting Attributes of an element
package main
import (
    "fmt"
    "log"
    "net/http"
    "github.com/PuerkitoBio/goquery"
)

func scrapeAttributes(){
    resp, err := http.Get("https://github.com/trending")
    if err != nil{
        log.Fatal(err)
    }
    defer resp.Body.Close()

    if resp.StatusCode != 200{
        log.Fatalf("Status code error: %d %s", resp.StatusCode, resp.Status)
    }
    doc, err := goquery.NewDocumentFromReader(resp.Body)
    if err != nil{
        log.Fatal(err)
    }
    fmt.Println(doc.Find("title").Text())

    doc.Find("ol li").Each(func(i int, s *goquery.Selection){
        href, has_attr := s.Find("a").First().Attr("href")
        if has_attr{
            fmt.Println("https://github.com" + href)
        }

    })
}

func main(){
    scrapeAttribtutes()
}

Output


$ go run example.go
Trending repositories on GitHub today · GitHubAsset 1Asset 1
https://github.com/you-dont-need/You-Dont-Need-Momentjs
https://github.com/ripienaar/free-for-dev
https://github.com/Nozbe/WatermelonDB
https://github.com/cjbarber/ToolsOfTheTrade
https://github.com/byoungd/English-level-up-tips-for-Chinese
https://github.com/TheAlgorithms/Python
https://github.com/thedaviddias/Front-End-Checklist
https://github.com/zziz/pwc
https://github.com/dawnlabs/carbon
https://github.com/CyC2018/CS-Notes
https://github.com/Avik-Jain/100-Days-Of-ML-Code
https://github.com/donnemartin/system-design-primer
https://github.com/mariusandra/pigeon-maps
https://github.com/Snailclimb/JavaGuide
https://github.com/JavaNoober/BackgroundLibrary
https://github.com/crossoverJie/JCSprout
https://github.com/Microsoft/nni
https://github.com/PansonPanson/Java-Notes
https://github.com/date-fns/date-fns
https://github.com/sindresorhus/ky
https://github.com/mciastek/sal
https://github.com/rwv/chinese-dos-games
https://github.com/vuejs/vue
https://github.com/GoogleCloudPlatform/open-match
https://github.com/lin-xin/vue-manage-system

 

Using class name or other attributes to get element
package main
import (
    "fmt"
    "log"
    "strings"
    "net/http"
    "github.com/PuerkitoBio/goquery"
)
func scrapeViaClassName(){
    resp, err := http.Get("https://github.com/trending")
    if err != nil{
        log.Fatal(err)
    }
    defer resp.Body.Close()

    if resp.StatusCode != 200{
        log.Fatalf("Status code error: %d %s", resp.StatusCode, resp.Status)
    }
    doc, err := goquery.NewDocumentFromReader(resp.Body)
    if err != nil{
        log.Fatal(err)
    }
    fmt.Println(doc.Find("title").Text())

    doc.Find("ol li").Each(func(i int, s *goquery.Selection){
        fmt.Println(strings.TrimSpace(s.Find(".float-sm-right").Text()))
    })
}


func main(){
    scrapeViaClassName()
}
Output


$ go run example.go
Trending repositories on GitHub today · GitHub
625 stars today
476 stars today
407 stars today
392 stars today
332 stars today
316 stars today
304 stars today
274 stars today
249 stars today
201 stars today
206 stars today
188 stars today
192 stars today
165 stars today
154 stars today
141 stars today
153 stars today
146 stars today
153 stars today
149 stars today
145 stars today
134 stars today
124 stars today
137 stars today
117 stars today

Navigate childrens from an element
package main
import (
    "fmt"
    "log"
    "strings"
    "net/http"
    "github.com/PuerkitoBio/goquery"
)
func navigateChildrens(){
    resp, err := http.Get("https://github.com/trending")
    if err != nil{
        log.Fatal(err)
    }
    defer resp.Body.Close()

    if resp.StatusCode != 200{
        log.Fatalf("Status code error: %d %s", resp.StatusCode, resp.Status)
    }
    doc, err := goquery.NewDocumentFromReader(resp.Body)
    if err != nil{
        log.Fatal(err)
    }
    fmt.Println(doc.Find("title").Text())
    olSelection := doc.Find("ol")
    olSelection.Children().Each(func(i int, s *goquery.Selection){ // using .Children() on the ol selection to get all li
        fmt.Println(strings.TrimSpace(s.Find("h3").Text()))
    })
}


func main(){
    navigateChildrens()
}
Output


$ go run example.go
Trending repositories on GitHub today · GitHub
you-dont-need / You-Dont-Need-Momentjs
ripienaar / free-for-dev
Nozbe / WatermelonDB
cjbarber / ToolsOfTheTrade
byoungd / English-level-up-tips-for-Chinese
TheAlgorithms / Python
thedaviddias / Front-End-Checklist
zziz / pwc
dawnlabs / carbon
CyC2018 / CS-Notes
Avik-Jain / 100-Days-Of-ML-Code
donnemartin / system-design-primer
mariusandra / pigeon-maps
Snailclimb / JavaGuide
JavaNoober / BackgroundLibrary
crossoverJie / JCSprout
Microsoft / nni
PansonPanson / Java-Notes
date-fns / date-fns
sindresorhus / ky
mciastek / sal
rwv / chinese-dos-games
vuejs / vue
GoogleCloudPlatform / open-match
lin-xin / vue-manage-system

The .children will only return the immediate childrens of the parent element.

Navigating previous and next siblings of an element
package main
import (
    "fmt"
    "log"
    "strings"
   "net/http"
    "github.com/PuerkitoBio/goquery"
)

func navigateSiblings(){
    resp, err := http.Get("https://github.com/trending")
    if err != nil{
        log.Fatal(err)
    }
    defer resp.Body.Close()
    if resp.StatusCode != 200{
        log.Fatalf("Status code error: %d %s", resp.StatusCode, resp.Status)
    }
    doc, err := goquery.NewDocumentFromReader(resp.Body)
    if err != nil{
        log.Fatal(err)
    }
    fmt.Println(doc.Find("title").Text())
    liSelection := doc.Find("ol li")
    fifthElement := liSelection.Eq(4) // using Eq() and passing the index we can navigate to the element with given index
    fmt.Println(strings.TrimSpace(fifthElement.Find("h3").Text()))
    fourthElement := fifthElement.Prev()
    fmt.Println(strings.TrimSpace(fourthElement.Find("h3").Text()))
    sixthElement := fifthElement.Next()
    fmt.Println(strings.TrimSpace(sixthElement.Find("h3").Text()))
}


func main(){
   navigateSiblings()
}
Output


$ go run example.go
Trending repositories on GitHub today · GitHub
byoungd / English-level-up-tips-for-Chinese
cjbarber / ToolsOfTheTrade
TheAlgorithms / Python

Putting it all together(Github Trending Scraper using Golang)
package main
import (
    "fmt"
    "log"
    "strings"
    //"io"
    //"os"
    "net/http"
    "github.com/PuerkitoBio/goquery"
)

func githubTrendingScraper(){
    resp, err := http.Get("https://github.com/trending")
    if err != nil{
        log.Fatal(err)
    }
    defer resp.Body.Close()
    if resp.StatusCode != 200{
        log.Fatalf("Status code error: %d %s", resp.StatusCode, resp.Status)
    }
    doc, err := goquery.NewDocumentFromReader(resp.Body)
    if err != nil{
        log.Fatal(err)
    }

    fmt.Println(doc.Find("title").Text())
    doc.Find("ol li").Each(func (i int, s *goquery.Selection){
        repositoryName := strings.TrimSpace(s.Find("h3").Text())
        totalStarsToday := strings.TrimSpace(s.Find(".float-sm-right").Text())
        href, has_attr := s.Find("a").Attr("href")
        if !has_attr{
            href = "No valid url found"
        }
        fmt.Println(repositoryName, "\t", totalStarsToday, "\t", "https://github.com" + href)
    })

}


func main(){
   githubTrendingScraper()
}
Output


$ go run example.go
Trending repositories on GitHub today · GitHub
you-dont-need / You-Dont-Need-Momentjs      625 stars today https://github.com/you-dont-need/You-Dont-Need-Momentjs
ripienaar / free-for-dev                    476 stars today https://github.com/ripienaar/free-for-dev
Nozbe / WatermelonDB                        407 stars today https://github.com/Nozbe/WatermelonDB
cjbarber / ToolsOfTheTrade                  392 stars today https://github.com/cjbarber/ToolsOfTheTrade
byoungd / English-level-up-tips-for-Chinese 332 stars today https://github.com/byoungd/English-level-up-tips-for-Chinese
TheAlgorithms / Python                      316 stars today https://github.com/TheAlgorithms/Python
thedaviddias / Front-End-Checklist          304 stars today https://github.com/thedaviddias/Front-End-Checklist
zziz / pwc                                  274 stars today https://github.com/zziz/pwc
dawnlabs / carbon                           249 stars today https://github.com/dawnlabs/carbon
CyC2018 / CS-Notes                          201 stars today https://github.com/CyC2018/CS-Notes
Avik-Jain / 100-Days-Of-ML-Code             206 stars today https://github.com/Avik-Jain/100-Days-Of-ML-Code
donnemartin / system-design-primer          188 stars today https://github.com/donnemartin/system-design-primer
mariusandra / pigeon-maps                   192 stars today https://github.com/mariusandra/pigeon-maps
Snailclimb / JavaGuide                      165 stars today https://github.com/Snailclimb/JavaGuide
JavaNoober / BackgroundLibrary              154 stars today https://github.com/JavaNoober/BackgroundLibrary
crossoverJie / JCSprout                     141 stars today https://github.com/crossoverJie/JCSprout
Microsoft / nni                             153 stars today https://github.com/Microsoft/nni
PansonPanson / Java-Notes                   146 stars today https://github.com/PansonPanson/Java-Notes
date-fns / date-fns                         153 stars today https://github.com/date-fns/date-fns
sindresorhus / ky                           149 stars today https://github.com/sindresorhus/ky
mciastek / sal                              145 stars today https://github.com/mciastek/sal
rwv / chinese-dos-games                     134 stars today https://github.com/rwv/chinese-dos-games
vuejs / vue                                 124 stars today https://github.com/vuejs/vue
GoogleCloudPlatform / open-match            137 stars today https://github.com/GoogleCloudPlatform/open-match
lin-xin / vue-manage-system                 117 stars today https://github.com/lin-xin/vue-manage-system