...
Create a new connection profile by selecting the top left corner icon on the “Select
Connection Profile” window.Select “Hive JDBC” from the Driver drop-down.
Update URL section with the Hive URL.
Provide your username and password received in the welcome email. NOTE: You are
not required to enter the “@internal.sdc.dot.gov” portion of your username to log on.Click on the Test button at the bottom to validate your connection. A pop-up dialog will
appear confirming a successful or failed connection. If you continue running into a failed
connection, contact the SDC support desk for assistance at sdc-support@dot.gov
Update Data Formatting Settings in SQL Workbench
Once the connection has been established, navigate to Tools | Options | Data formatting and update the Decimal digits value to 0.
...
Connecting to the SDC Hadoop Data Warehouse Using Python
Important: The default version of Python installed on the SDC Windows Workstations is
v2.7.4. There are two required Python modules that must be installed prior to attempting to
connect to Hadoop/Hive with Python using the example code below. To install these modules,
open a Windows Command Prompt, and enter the following two commands:
C:\Users\username> pip install impyla
C:\Users\username> pip install numpy
The above "pip install …" command(s)s only need to be run ONCE on the SDC Windows
Workstation. Once the Python modules are installed, they remain available, even across reboots
of the workstation.
To test Python connectivity to the data warehouse, open the IDLE python editor and execute:
Code Block |
---|
from __future__ import print_function
from impala.dbapi import connect
import numpy
conn = connect(
host='[host address]',
port=10000,
auth_mechanism='PLAIN',
user='[your_username]' ,password='[your_password]')
cursor = conn.cursor()
cursor.execute('SHOW TABLES')
result = cursor.fetchall()
result = numpy.array(result)
# print(result)
for r in result:
print (r)
|
This should result in an array of tables displayed to the user.
Connecting to Redshift from Linux Environment
Credentials to access the Waze Redshift database are communicated from the SDC Support (sdc-support@dot.gov)
• In R, it is possible to connect to Redshift using multiple packages. The RPostgreSQL
package provides a simple method. This package requires the PostgreSQL library to be
installed at the system level; if it is not installed, it would be necessary to install as root in
the terminal:
$ sudo yum install postgresql-devel
• In R, you may need to install.packages(“RPostgreSQL”, dep=T) if you
do not already have the package installed.
• Connect to Redshift using the following code as a guide:
Code Block |
---|
library(RPostgres)
# Specify username and password manually, once:
if(Sys.getenv("sdc_waze_username")==""){
cat("Please enter SDC Waze username and password
manually, in the console, the first time accessing the
Redshift database, using: \n Sys.setenv('sdc_waze_username'
= <see email from SDC Administrator>) \n
Sys.setenv('sdc_waze_password' = <see email from SDC
Administrator>)")
}
redshift_host <- "(details provided by SDC Support to
registered SDC Redshift Users)"
redshift_port <- "5439"
redshift_user <- Sys.getenv("sdc_waze_username")
redshift_password <- Sys.getenv("sdc_waze_password")
redshift_db <- "dot_sdc_redshift_db"
#drv <- dbDriver("PostgreSQL")
conn <- dbConnect(
RPostgres::Postgres(),
host=redshift_host,
port=redshift_port,
user=redshift_user,
password=redshift_password,
dbname=redshift_db)
|
• A database can then be queried using the dbGetQuery() function.
Accessing Jupyter Notebook and RStudio Server
Linux users can access their Jupyter Notebook and RStudio Server using the Firefox web
browser through windows workstation using below URLs.
• RStudio – http://<username>-workspace.internal.sdc.dot.gov:8787/
• Jupyter Notebook – http://<username>-workspace.internal.sdc.dot.gov:8888/
Windows users can click on the “RStudio” shortcut icon present on the desktop to open
RStudio console.
Manage Workstations
After launching their workstations, users can manage resizing CPU/RAM and scheduling uptime for a workstation by clicking on its Manage button as shown below.
...
A dialogue window appears with two checkbox options:
...
Selecting each option renders the appropriate tabs in the dialogue window. The icon shown next to each option provides an informational tooltip on their functions.
Resize Workstation
To resize the workstation, select the checkbox for Resize Workstation and then Next to continue.
A message is shown at the bottom of the screen indicating that the workstation will be stopped before applying the resize.
3. The Resize Workstation tab allows users to select desired CPU/RAM for their
workstation. Current configurations will be grayed out and unavailable. Users can also
explore pricing details using the link provided under “click here.”
4. Select the “Please start my workstation after resizing to the new configuration” checkbox
to automatically start the workstation with the new configuration after saving changes.
5. Select Submit after all details are entered.
6. A Recommended List of instances will appear. Select the desired instance and then the
Next button.
...
7. On the Schedule Date tab, users are prompted to enter a date range for how long the
resize should last for the workstation instance. Enter the From and To dates and then
select Submit.
...
8. Users will be returned to the Workstations tab with updated CPU and memory information. They will also receive a success email message from the system confirming the resize expiration date.
Schedule/Extend Uptime
By default, all workstations are shut down at 11 pm EST. If you want to schedule your
workstations to be up for a longer period to accommodate analysis runs, select the
checkbox for Schedule Workstation Uptime and then Next to continue.The Schedule Workstation Uptime tab allows users to enter a date range for how long the
workstation uptime should last to skip shutdown. Enter the From and To dates and then
select Submit.To extend any currently scheduled uptime for the workstation, select the Workstations
tab and then select Manage again for the workstation. A new tooltip is now shown for the
Schedule Workstation Uptime checkbox on mouse hover that indicates previously
scheduled uptime.Repeat steps 1-2. For step 2, the From date will already include the date from the
previously scheduled uptime. Add a new To date later in the calendar and then submit the
update. The previously scheduled uptime goes inactive while the new one becomes
active.After selecting Submit, return to the Workstations tab and then select Manage for the
workstation. The tooltip shown on hover for the Schedule Workstation Uptime checkbox
now displays the extended uptime.
Stop Workstations
Users can see the assigned workstations by clicking on the workstations tab on the top right corner of the page. By default, all the workstations are scheduled to stop every day at 11 PM EST. Users can stop the workstations manually by clicking on the Stop button as shown below. A message will appear when the instance is successfully stopped.
...